BURLINGTON THE CARUS MATHEMATICAL MONOGRAPHS Published by THE MATHEMATICAL ASSOCIATION OF AMERICA Publication Committee GILBERT AMES BLISS DAVID RAYMOND CURTISS HERBERT ELLSWORTH SLAUGHT THE CARUS MATHEMATICAL MONOGRAPHS are an expression of the desire of Mrs. Mary Hegeler Carus, and of herson, Dr. Edward H. Carus, to contribute to the dissemina- tion of mathematical knowledge by niaking accessible at nominal cost a series of expository presentations of the best thoughts and keenest researches in pure and applied mathematics. The publication of the first four of these monographs was made possible by a notable gift to the Mathematical Association of America by Mrs. Carus as sole trustee of the Edward C. Hegeler Trust Fund. The sales from these have resulted in the Carus Monograph Fund, and the Mathematical Association has used this as a revolving book fund to publish the fifth and sixth monographs. The expositions of mathematical subjects which the monographs contain are set forth in a manner comprehensible not only to teachers and students specializing in mathematics, but also to scientific workers in other fields, and especially to the wide circle of thoughtful people who, having a moderate acquaintance with elementary mathe- matics, wish to extend their knowledge without prolonged and critical study of the mathematical journals and treatises. The scope of this series includes also historical and biographical monographs. The following books in this series have been published to date: No. 1. Calculus of Variations, by Gilbert Ames Bliss. No. 2. Analytic Functions of a Complex Variable, by David Ray- mond CURTISS No. 3. Mathematical Statistics, by Henry Lewis Rietz. No. 4. Projective Geometry, by John Wesley Young. No. 5. A History of Mathematics in America Before 1900, by David Eugene Smith and Jekuthiel Ginsburg. No. 6. Fourier Series and Orthogonal Polynomials by Dunham Jack- son. No. 7. Vectors and Matrices, by C. C. MacDuffee. The Cams Mathematical Monographs NUMBER THREE MATHEMATICAL STATISTICS HENRY LEWIS RIETZ Professor of Mathematics ^ The Universiiy oj Iowa Published for THE MATHEMATICAL ASSOCIATION OF AMERICA by THE OPEN COURT PUBLISHING COMPANY LA SALLE • ILLINOIS It CoPYBiaHT 1927 By The MATHEMATioAii Association ov Amk&ioa Published April 1927 Second Printing 1929 Third Printing 1936 Fourth Printing 1943 Fifth Printing\l947 y Reprinted by John S. Swift Co., Inc. CHICAGO liOUIS CINCINNATI NBW YORK PREFACE This book on mathematical statistics is the third of the series of Cams Mathematical Monographs. The purpose of the monographs, admirably expressed by Professor Bliss in the first book of the series, is "to make the essential features of various mathematical theories more accessible and attractive to as many persons as possible who have an interest in mathematics but who may not be specialists in the particular theory presented. '* The problem of making statistical theory available has been changed considerably during the past two or three years by the appearance of a large number of text- books on statistical methods. In the course of preparation of the manuscript of the present volume, the writer felt at one time that perhaps the recent books had covered the ground in such a way as to accomplish the main pur- poses of the monograph which was in process of prepara- tion. But further consideration gave support to the view that although the recent books on statistical method will serve useful purposes in the teaching and standardization of statistical practice, they have not, in general, gone far toward exposing the nature of the underlying theory, and some of them may even give misleading impressions as to the place and importance of probability theory in statistical analysis. It thus appears that an exposition of certain essential features of the theory involved in statistical analysis would conform to the purposes of the Cams Mathemati- cal Monographs, particularly if the exposition could be vi PREFACE made interesting to the general mathematical reader. It is not the intention in the above remarks to imply a criticism of the books in question. These books serve certain useful purposes. In them the emphasis has been very properly placed on the use of devices which facili- tate the description and analysis of data. The present monograph will accomplish its main purpose if it makes a slight contribution toward shifting the emphasis and point of view in the study of statistics in the direction of the consideration of the underlying theory involved in certain highly important methods of statistical analysis, and if it introduces some of the re- cent advances in mathematical statistics to a wider range of readers. With this as our main purpose it is natural that no great effort is being made to present a well- balanced discussion of all the many available topics. This will be fairly obvious from omissions which will be noted in the following pages. For example, the very im- portant elementary methods of description and analysis of data by purely graphic methods and by the use of various kinds of averages and measures of dispersion are for the most part omitted owing to the fact that these methods are so available in recent elementary books that it seems tuinecessary to deal with them in this mono- graph. On the other hand, topics which suggest making the underlying theories more available are emphasized. For the purpose of reaching a relatively large number of readers, we are fortunate in that considerable portions of the present monograph can be read by those who have relatively little knowledge of college mathematics. How- ever, the exposition is designed, in general, for readers of a certain degree of mathematical maturity, and presup- PREFACE vii poses an acquaintance with elementary differential and integral calculus, and with the elementary principles of probability as presented in various books on college alge- bra for freshmen. A brief list of references is given at the end of Chapter VII. This is not a bibliography but simply includes books and papers to which attention has been directed in the course of the text by the use of superscripts. The author desires to express his special indebtedness to Professor Burton H. Camp who read critically the entire manuscript and made many valuable suggestions that resulted in improvements. The author is also in- debted to Professor A. R. Crathorne for suggestions on Chapter I and to Professor E. W. Chittenden for certain suggestions on Chapters II and III. Lastly, the author is deeply indebted to Professor Bliss and to Professor Curtiss of the Publication Committee for important criticisms and suggestions, many of which were made with special reference to the purposes of the Carus Mathe- matical Monographs. Henry L. Rietz The University of Iowa December, 1926 TABLE OF CONTENTS CBAPTEK PAGE I. The Nature of the Problems and Underlying Concepts of Mathematical Statistics .... 1 1. The scope of mathematical statistics 2. Historical remarks 3. Two genera! types of problems 4. Relative frequency and probability 5. Observed and theoretical frequency distributions 6. The arithmetic mean and mathematical expecta- tion 7. The mode and the most probable value 8. Moments and the mathematical expectations of powers of a variable II. Relative Frequencies in Simple Sampling ... 22 9. The binomial description of frequency 10. Mathematical expectation and standard deviation of the number of successes 11. Theorem of Bernoulli 12. The De Moivre-Laplace theorem 13. The quartile deviation 14. The law ol small probabilities. The Poisson expo- nential function III. Frequency Functions of One Variable .... 46 15. Introduction 16. The Pearson system of generalized frequency curves 17. Generalized normal curves — Gram-Charlier series 18. Remarks on the genesis of Type A and Type B forms 19. The coefficients of the Type A series expressed in moments of the observed distribution 20. Remarks on two methods of determining the co- efficients of the Type A series X TABLE OF CONTENTS CHAPTER ' PAGF 21. The coefficients of the Type B series 22. Remarks 23. Skewness 24. Excess 25. Remarks on the distribution of certain transformed variates 26. Remarks on the use of various frequency functions as generating functions in a series representation IV. Correlation 77 27. Meaning of simple correlation 28. The regressive method and the correlation surface method of describing correlation 29. The correlation coefficient THE REGRESSION METHOD OF DESCRIPTION 30. Linear regression 31. The standard deviation of arrays — mean square error of estimate 32. Non-linear regression — the correlation ratio 33. Multiple correlation 34. Partial correlation 35. Non-linear regression in n variables — multiple cor relation ratio 36. Remarks on the place of probability in the regres- sion method THE CORRELATION SURFACE METHOD OF DESCRIPTION 37. Normal correlation surfaces 38. Certain properties of normally correlated dis- tributions 39. Remarks on further methods of characterizing correlation V. On Random Sampling Fluctuations 1 14 40. Introduction 41. Standard error and correlation of errors in dass frequencies TABLE OF CONTENTS XI CHAPTER P*®^ 42. Remarks on the assumptions involved in the deriva- tion of standard errors 43. Standard error in the arithmetic mean and in a ^th moment coefficient about a fixed point 44. Standard error of the ^th moment fx^ about a mean 45. Remarks on the standard errors of various statis- tical constants 46. Standard error of the median 47. Standard deviation of the sum of independent variables 48. Remarks on recent progress with sampling errors of certain averages obtained from small samples 49. The recent generalizations of the Bienayme- Tchebycheff criterion 50. Remarks on the sampling fluctuations of an ob- served frequency distribution from the underlying theoretical distribution VI. The Lexis Theory 146 51. Introduction 52. Poisson series 53. Lexis series 54. The Lexis ratio Vll A Development of the Gram-Charlier Series . . 156 55. Introduction 56. On a development of Type A and Type B from the law of repeated trials 57. The values of the coefficients of the Type A series obtained from the biorthogonal property 58. The values of the coefficients of type A series ob- tained from a least-squares criterion 59. The coefficients of a Type B series Notes 173 Index 178 CHAPTER I THE NATURE OF THE PROBLEMS AND UNDER- LYING CONCEPTS OF MATHEMATICAL STATISTICS 1. The scope of mathematical statistics. The bounds of mathematical statistics are not sharply defined. It is not uncommon to include under mathematical statistics such topics as interpolation theory, approximate integra- tion, periodogram analysis, index numbers, actuarial theory, and various other topics from the calculus of ob- servations. In fact, it seems that mathematical statistics in its most extended meaning may be regarded as includ- ing all the mathematics appUed to the analysis of quanti- tative data obtained from observation. On the other hand, a number of mathematicians and statisticians have implied by their writings a limitation of mathematical statistics to the consideration of such questions of fre- quency, probability, averages, mathematical expectation, and dispersion as are likely to arise in the characterization and analysis of masses of quantitative data. Borel has expressed this somewhat restricted point of view in his statement^ that the general problem of mathematical sta- tistics is to determine a system of drawings carried out with urns of fixed composition, in such a way that the results of a series of drawings lead, with a very high degree of probability, to a table of values identical with the table of observed values. * For footnote references, see pp. 173-77. 2 NATURE OF PROBLEMS AND CONCEPTS On account of the different views concerning the boundaries of the field of mathematical statistics there arose early in the preparation of this monograph ques- tions of some difficulty in the selection of topics to be in- cluded. Although no attempt will be made here to answer the question as to the appropriate boundaries of the field for all purposes, nevertheless it will be convenient, partly because of limitations of space, to adopt a somewhat re- stricted view with respect to the topics to be included. To be more specific, the exposition of mathematical statistics here given will be limited to certain methods and theories which, in their inception, center around the names of Bernoulli, De Moivre, Laplace, Lexis, Tchebycheff, Gram, Pearson, Edgeworth, and Charlier, and which have been much developed by other contributors. These meth- ods and theories are much concerned with such concepts as frequency, probability, averages, mathematical expec- tation, dispersion, and correlation. 2. Historical remarks. While we are currently experi- encing a period of special activity in mathematical statis- tics which dates back only about forty years, some of the concepts of mathematical statistics are by no means of recent origin. The word "statistics" is itself a compara- tively new word as shown by the fact that its first occur- rence in English thus far noted seems to have been in J. F. von Bielfeld, The Elements of Universal Eruditiony trans- lated byW. Hooper, London, 1770. Notwithstanding the comparatively recent introduction of the word, certain fundamental concepts of mathematical statistics to which attention is directed in this monograph date back to the first publication relating to Bernoulli's theorem in 1713. The line of development started by Bernoulli was carried HISTORICAL REMARKS 3 forward by Stirling (1730), De Moivre (1733), Euler (1738), and Maclaurin (1742), and culminated in the formulation of the probability theory of Laplace. The Theorie Analytique des Probabilites of Laplace published in 1812 is the most significant pubKcation underlying mathematical statistics. For a period of approximately fifty years following the publication of this monumental work there was relatively little of importance contributed to the subject. While we should not overlook Poisson's extension of the Bernoulli theory to cases where the prob- ability is not constant, Gauss's development of methods for the adjustment of observations, Bravais's extension of the normal law to functions of two and three variables, Quetelet's activities as a popularizer of social statistics, nevertheless there was on the whole in this period of fifty years little progress. The lack of progress in this period may be attributed to at least three factors: (1) Laplace left many of his re- sults in the form of approximations that would not readi- ly form the basis for further development; (2) the follow- ers of Gauss retarded progress in the generalization of fre- quency theory by overpromoting the idea that deviations from the normal law of frequency are due to lack of data; (3) Quetelet overpopularized the idea of the stability of certain striking forms of social statistics, for example, the stability of the number of suicides per year, with the natural result that his activities cast upon statistics a suspicion of quackery which exists even to some extent at present. An important step in advance was taken in 1877 in the publication of the contributions of Lexis to the classi- fication of statistical distributions with respect to normal, 4 NATURE OF PROBLEMS AND CONCEPTS supemonnal, and subnormal dispersion. This theory will receive attention in the present monograph. The development of generalized frequency curves and the contributions to a theory of correlation from 1885 to 1900 started the period of activity in mathematical statis- tics in which we find ourselves at present. The present monograph deals largely with the progress in this period, and with the earlier underlying theory which facilitated relatively recent progress. 3. Two general types of problems. For purposes of description it seems convenient to recognize two general classes of problems with which we are concerned in mathe- matical statistics. In the problems of the first class our concern is largely with the characterization of a set of numerical measurements. or estimates of some attribute or attributes of a given set of individuals. For example, we may establish the facts about the heights of 1,000 men by finding averages, measures of dispersion, and various statistical indexes. Our problem may be limited to a characterization of the heights of these 1,000 men. In the problems of the second class we regard the data obtained from observation and measurement as a random sample drawn from a well-defiLQed class of items which may include either a limited or an imlimited supply. Such a well-defined class of items may be called the "popula- tion" or universe of discourse. We are in this case con- cerned with using the properties of a random sample of variates for the purpose of drawing inferences about the larger population from which the sample was drawn. For example, in this class of problems involving the heights of the 1,000 men we would be concerned with the ques- TWO GENERAL TYPES OF PROBLEMS 5 tion: What approximate or probable inferences may be drawn about the statures of a whole race of men from an analysis of the heights of a sample of 1,000 men drawn at random from the men of the race? In dealing with such questions, we should in the first place consider the diffi- culties involved in drawing a sample that is truly random, and in the next place the problem of developing certain parts of the theory of probability involved in statistical inference. The two classes of problems to which we have directed attention are not, however, entirely distinct with regard to their treatment. For example, the conceptions of prob- able and standard error may be used both in describing the facts about a sample and in indicating the probable degree of precision of inferences which go beyond the ob- served sample by dealing with certain properties of the population from which we conceive the sample to be drawn. Moreover, a satisfactory description of a sample is not likely to be so purely descriptive as wholly to pre- vent the mind from dwelling on the inner meaning of the facts in relation to the population from which the sample is drawn. As a preliminary to dealing in later chapters with cer- tain of the problems falling under these two general classes we shall attempt in the present chapter to discuss briefly the nature of certain underlying concepts. We shall find it convenient to consider these concepts in pairs as fol- lows: relative frequency and probability; observed and theoretical frequency distributions; arithmetic mean and mathematical expectation; mode and most probable val- ue; moments and mathematical expectations of a power of a variable. 6 NATURE OF PROBLEMS AND CONCEPTS 4. Relative frequency and probability. The frequency / of the occurrence of a character or event among 5 possi- ble occurrences is one of the simplest items of statistical information. For example, any one of the following items illustrates such statistical information: Five deaths in a year among 1,000 persons aged 30, nearest birthday; 610 boys among the last 1,200 children born in a city; 400 married men out of a total of 1,000 men of age 23; twelve cases of 7 heads in throwing 7 coins 1,536 times. The determination of the numerical values of the rela- tive frequencies//^ corresponding to such items is one of the simplest problems of statistics. This simple problem suggests a fundamental problem concerning the probable or expected values of such relative frequencies if s were a very large number. When s is a' large number, the rela- tive frequency f/s is very commonly accepted in applied statistics as an approximate measure of the probability of occurrence of the event or character on a given occa- sion. To take an illustration from an important statistical problem, let us assume that among / persons equally likely to live a year we find d observed deaths during the year. That is, we assume that d represents the frequency of deaths per year among the / persons each exposed for one year to the hazards of death. If / is fairly large, the rela- tive frequency d/l is often regarded as an approximation to what is to be defined as the probability of death of one such person within a year. In fact, it is a fundamental assumption of actuarial science that we may regard such a relative frequency as an approximation to the proba- bility of death when a sufficiently large number of persons are exposed to the hazards of death. For a numerical illus RELATIVE FREQUENCY AND PROBABILITY 7 tration, suppose there are 600 deaths among 100,000 per- sons exposed for a year at age 30. We accept .006 as an approximation to the probability in question at age 30. In the method of finding such an approximation we decide on a population which constitutes an appropriate class for investigation and in which individuals satisfy certain conditions as to likeness. Then we depend on observation to obtain the items which lead to the relative frequency which we may regard as an approximation to the proba- bility. For an ideal population, let us conceive an urn con- taining white and black balls alike except as to color and thoroughly mixed. Suppose further for the present that we do not know the ratio of the number of white balls to the total number in this urn which we may conceive to contain either any finite number or an indefinitely large number of balls. This ratio is often called the probability of drawing a white ball. When the number in the urn is finite, we make drawings at random consisting of j balls taken one at a time with replacements to keep the ratio of the numbers of white and black balls constant. If we may assume the number in the urn to be infinite, the drawings ma)^ under certain conditions be made without replacements. Suppose we obtain/ white balls as a result of thus drawing 5 balls, then we say that//5 is the relative frequency with which we drew white balls. When s is large, this relative frequency would ordinarily give us an approximate value of the probability of drawing a white ball in one trial, that is, an approximate value of the ratio of white balls to the total number of balls in the urn. Thus far we have not defined probability, but have 8 NATURE OF PROBLEMS AND CONCEPTS presented illustrations of approximations to probabilities. While these illustrations seem to suggest a definition, it is nevertheless difficult to frame a definition that is satis- factory and includes all forms of probability. The need for the concepts of relative frequency and probability in statistics arises when we are associating two events such that the first may be regarded as a trial and the second may be regarded as a success or a failure depending on the result of the trial. The relative frequency of success is then the ratio of the number of successes to the total number of trials. // the relative frequency of success approaches a limit when the trial is repeated indefinitely under the same set of circumstances y this limit is called the probability of success in one trial. There are some objections to this definition of proba- bility as well as to any other that we could propose. One objection is concerned with questioning the validity of the assumption that a limit of the relative frequency exists, and another relates to the meaning of the expres- sion, "the same set of circumstances." That the limit exists is an empirical assumption whose validity cannot be proved, but experience with data in many fields has given much support to the reasonableness and usefulness of the assumption. The objection based on the difficulty of controlling conditions so as to repeat the trial under the same set of circumstances is an objection that could be brought against experimental science in general with re- spect to the difficulties of repeating experiments under the same circumstances. The experiments are repeated as nearly as circumstance permits. It seems fairly obvious that the development of sta- MEANING OF PROBABILITY 9 tistical concepts is approached more naturally from this limit definition than from the familiar definitions suggest- ed by games of chance. However, we shall at certain points in our treatment (for example, see § 11) give attention to the fact that various definitions of proba- bility exist in which the assumptions dififer from those involved in the above definition. The meaning of proba- bility in statistics is fairly well expressed for some purposes by any one of the expressions, theoretical relative frequency, presumptive relative frequency, or ex- pected value of a relative frequency. Indeed, we some- times express the fact that the relative frequency f/s is assumed to have the probabiKty /> as a limit when 5->oo in abbreviated form by writing E{f/s)—py where E{f/s) is read, "expected value oif/sJ* It is fairly clear that in our definition of probability we simply ideal- ize actual experience by assuming the existence of a limit of the relative frequency. This idealization, for purposes of definition, is in some respects analogous to the ideal- ization of the chalk mark into the straight line of geometry. In certain cases, notably in games of chance or urn schemata, the probability may be obtained without col- lecting statistical data on frequencies. Such cases arise when we have urn schemata of which we know the ratio of the number of white balls to the total number. For example, suppose an urn contains 7 white and 3 black balls and that we are to inquire into the probability that a ball to be drawn will be white. We could experiment by drawing one ball at a time with replacements until we had made a very large number of drawings and then esti- mate the probability from the ratio of the number of lO NATURE OF PROBLEMS AND CONCEPTS white balls to the total number of balls drawn. It would however in this case ordinarily be much more convenient and satisfying to examine the balls to note that they are alike except as to color and then make certain assump- tions that would give us the probability without actually making the trials. Thus, when all the possible ways of drawing the balls one at a time may be analyzed into 10 equally likely ways, and when 7 of these 10 ways give white balls, we assume that 7/10 is the probability that the ball to be drawn in one trial will be white. This simple case illustrates the following process of arriving at a probability: // all of an aggregate of ways of obtaining successes and failures can be analyzed into s^ possible mutually exclusive ways each of which is equally likely; and iff of these ways give successes, the probability of a success in a single trial may be taken to be f js' , Thus in throwing a single die, what is the probability of obtaining an ace? We assume that there are 6 equally likely ways in which the die may fall. One of these ways gives an ace. Hence, we say 1/6 is the probability of throwing an ace. A probability whose value is thus ob- tained from an analysis of ways of occurrence into sets of equally likely cases and a segregation of the cases in which a success would occur is sometimes called an a priori probability, while a probability whose approximate value is obtained from actual statistical data on repeated trials is called an a posteriori or statistical probability. In making an analysis to study probabilities, difficult questions arise both as to the meaning and fulfilment of the condition that the ways are to be "equally likely." These questions have been the subject of lively debates MEANING OF PROBABILITY ii by mathematicians and philosophers since the time of Laplace. It has been fairly obvious that the expression ''equally likely ways" implies as a necessary condition that we have no information leading us to expect the event to occur in one of two ways rather than in the other, but serious doubt very naturally arises as to the suffi- ciency of this condition. In fact, it is fairly clear that lack of information is not sufficient. For example, lack of in- formation as to whether a spinning coin is symmetrical and homogeneous does not assist one in passing on the validity of the assumption that it is equally likely to turn up head or tail. It is when we have all available relevant information on such matters as symmetry and homoge- neity that we have a basis for the inference that the two ways are equally likely, or not equally likely. Similarly, lack of information about two large groups of men of age 30 would not assist us in making the inference that the mortality rates or probabilities of death are approximate- ly equal for the two groups. On the other hand, relevant information in regard to the results of recent medical examinations, occupations, habits, and family histories would give support to certain inferences or assumptions concerning the equality or inequality of the mortality rates for the two groups. 5. Observed and theoretical frequency distributions. In many statistical investigations, it is convenient to par- tition the whole group of observations into subgroups or classes so as to show the number or frequency of observa- tions in each class. Such an exhibit of observations is called an "observed frequency distribution." As illustra- tions we present the following, where the rows marked F are the observed frequency distributions: 12 NATURE OF PROBLEMS AND CONCEPTS Example i. A= lengths of ears of corn in inches. A.... 3 4.5 6.0 7.5 9.0 10.5 12.0 P.... 1 3 20 63 170 67 3 Example 2. A = prices of commodities for 1919 rela- tive to price of 1913 as a base. ^. .62 87 112 137 162 187 212 237 262 287 312 337 362 387 412 437 462 F.. \ 1 5 16 39 66 61 36 38 24 9 3 3 3 1 Example j. A= heights of men in inches. ^.... 61 62 63 64 65 66 67 68 69 70 71 72 73 74 F.... 2 10 11 38 57 93 106 126 109 87 75 23 9 4 In Example 1 the whole group of ears of corn is ar- ranged in classes with respect to length of ears. The class interval in this case is taken to be one and one-half inches. In Example 2 the class interval is a number, twenty-five; in Example 3, it is one inch. If the variable x takes values Xi, X2j . . . . , Xn with the corresponding probabilities />i, />2, • • . . , pn, we call the system of values Xi, X2, . . . , , Xn and the associated probabilities or numbers proportional to them, the theo- retical frequency distribution of the variable x. Thus, we may write for the theoretical frequency distribution of the number of heads in throwing three coins: Heads 12 3 Probabilities 1/8 3/8 3/8 1/8 Theoretical frequencies. . 13 3 1 When for a given set of values of a variable x there exists a function F(x) such that the ratio of the number of values of X on the interval ab to the number on the interval a^b' is the ratio of the integrals CFix)dx : j F(x)dx , FREQUENCY FUNCTION 13 for all choices of the intervals ab and a'b\ then F(x) is called the frequency function y or the probability density, or the law of distribution of the values of x. The curve y = F(x) is called a theoretical frequency curve, or more briefly the frequency curve. To devise methods for the description and characteri- zation of the various types of frequency distributions which occur in practical problems of statistics is clearly / V,. 1'20 no / N / \S 90 i \ I 1 jeo 1 \ il \\ 340 630 20 10 \ i 1 \ / \ / f \ ^ ^ % L 61 fi3 63 (H 65 M S7 68 (>9 70 71 7i 73 74 75 Fig. 1 Showing frequency polygon and free-hand fire- qaency curve of the dibtributioa of heights of men la Example 3. of fundamental importance. Such a description or charac- terization may be effected with various degrees of refine- ment ranging all the way from one extreme with a simple frequency polygon or freehand curve (Fig. 1) representing frequencies by ordinates, to a description at the other extreme by means of a theoretical frequency curve grounded in the theory of probability. It is fairly obvious that the latter type of description is likely to be much more satisfactory than the former because a deeper meaning is surely given to an observed distribution if we can effectively describe it by means of 14 NATURE OF PROBLEMS AND CONCEPTS a theoretical frequency curve than if we can give only a freehand or an empirical curve as the approximate repre- sentation. However, we should not overlook the fact that the description by means of a theoretical curve may be too ponderous and laborious for the particular purpose of an analysis. Indeed, the use of the theoretical curve is likely to be justified in a large way only when it facilitates the study of the properties of the class of distributions of which the given one is a random sample by enabling us to make use of the properties of a mathematical function F{x) in establishing certain theoretical norms for the de- scription of a class of actual distributions. As important supplements to the purely graphic method, we may de- scribe the frequency distribution by the use of averages, measures of dispersion, skewness, and peakedness. Such descriptions facilitate the comparison of one distribution with another with respect to certain features. 6. The arithmetic mean and mathematical expecta- tion. The arithmetic mean (AM) of n numbers is simply the sum of the numbers divided by w. That is, the arith- metic mean of the numbers ^1, X2f . . . . , Xn is given by the formula Xi-\-X2-{' • • • • -VXn (1) AM = The AM is thus what is usually meant by the terms ''mean," "average," or "mean value" when used without further qualification. If the values Xi^ X2, . . , . , Xn occur with corresponding frequencies /i, /2, . . . . , /», respec- ARITHMETIC MEAN 15 tively, where /1+/2+ . . . . + fn = s, then it follows from (1) that the arithmetic mean is given by (2) AM = flXl-\-f2X2-{- ' ' • ' +fnXf^ (3) where =-^rci+4^2+----+4^n, /lA+/2A+-.--+/nA=l The arithmetic mean given by (2) is sometimes called a "weighted arithmetic mean" where /i,/2, ....,/„ are the weights of the values Xi, x^, . , . . j Xn, respectively, and (3) may similarly be regarded as a weighted arith- metic mean, where fl/Sy h/sy , fn/s are the weights of Xi, 0:2, .... , Xn, respectively. For our present purpose it is important to note that the coefficients of Xi, X2, . . . . , Xn'm (3) are the rela- tive frequencies of occurrence of these values. By defini- tion of statistical probabilities, the limiting value of ft/s as s increases indefinitely is pt, where pt is the assumed probability of the occurrence of a value Xt ai^ong a set of mutually exclusive values Xi^ 0^2, .... , Xn. Hence, as the number of cases considered becomes infinite, the arith- metic mean would approach a value given by (4) AM-=-pxXx-\-p2X2-\r • • • • -^-pnXn , i6 NATURE OF PROBLEMS AND CONCEPTS where the probabilities />i, p2, » » * - y pn may be regard- ed as the weights of the corresponding values The mathematical expectation of the experimenter or the expected value of the variable is a concept that has been much used by various continental European writers on mathematical statistics. Suppose we consider the probabilities Pi, pi^ . . . . , pn of w mutually exclusive events £i, jEj, . . . , -£„, so that /'iH-/'2+ • • • • +/>« = !. Suppose that the occurrence of one of these, say £< , on a given occasion yields a value a;« of a variable x. Then the mathematical expectation or expected value E{x) of the variable x which takes on values Xi, x^j , . , , Xn with the probabilities pi^ /'2, . . . . , pn, respectively, may be defined as (5) E{x)^piXi-\-p2Xi'\ • -\-pnXn • We thus note by a comparison of (4) and (5) the identity of the limit of the mean value and the mathematical ex- pectation. Furthermore, in dealing with a theoretical distribution in which pt is the probability that a variable x assumes a value Xt among the possible mutually exclusive values xi, X2y , , , , , Xn, and pi+p2-\- • • • • +/>» = !, we have (6) AM = piXi + p2X2-\' • • • • -\-pnXn • That is, the mathematical expectation of a yariable x and its mean value from the appropriate theoretical distribu- tion are identical. While there are probably differences of opinion as to the relative merits of the language involv- ing mathematical expectation or expected value in com- MODE AND MOST PROBABLE VALUE 17 parison with the language which uses the mean value of a theoretical distribution, or mean value as the number of cases becomes infinite, the language of expectation seems the more elegant in many theoretical discussions. For the discussions in the present monograph we shall employ both of these types of language. 7. The mode and the most probable value. The mode or modal value of a variable is that value which occurs most frequently (that is, is most fashionable) if such a value exists. Rough approximations to the mode are used consider- ably in general discourse. To illustrate, the meaning of the term "average" as frequently used in the newspapers in speaking of the average man seems to be a sort of crude approximation to the mode. That is, the term "average'* in this connection usually implies a type which occurs oftener than any other single type. The mode presents one of the most striking character- istics of a frequency distribution. For example, consider the frequency distribution of ears of corn with respect to rows of kernels on ears as given in following table: A 10 12 14 16 18 20 22 24 F 1 16 109 241 235 116 41 10. where A = number of rows of kernels and F = frequency. It may be noted that the frequency increases up to the class with 16 rows and then decreases. The mode in rela- tion to a frequency distribution is a value to which there corresponds a greater frequency than to values just pre- ceding or immediately following it in the arrangement. That is, the mode is the value of the variable for which the frequency is a maximum. A distribution may have more than one maximum, but the most common types of 1 8 NATURE OF PROBLEMS AND CONCEPTS frequency distributions of both theoretical and practical interest in statistics will be found to have only one mode. The expression "most probable value" of the number of successes in s trials is used in the general theory of probability for the number to which corresponds a larger probability of occurrence than to any other single number which can be named. For example, in throwing 100 coins, the most probable number of heads is 50, because 50 is more likely than any other single number. This does not mean, however, that the probability of throwing exactly 50 heads is large. In fact, it is small, but nevertheless greater than the probability of throwing 49 or any other single number of heads. In other words, the most probable value is the modal value of the appropriate theoretical distribution. 8. Moments and the mathematical expectations of powers of a variable. With observed frequencies /i, /2, ....,/„ corresponding to Xi, ^2, . . . . , Xn, respectively, and with /1+/2+ • • • • +fn = s, the ^th order moment , per unit frequency J is defined as t=n (7) Mi=-\^M, which is the arithmetic mean of the ^th powers of the variates. For the sake of brevity, we shall ordinarily use the word "moment" as an abbreviation for "mcment per unit frequency," when this usage will lead to no misunder- standing of the meaning. Consider a theoretical distribution of a variable x tak- ing values Xt{t = \, 2, . . . . , n). Let the corresponding probabilitiesof occurrence /><(/ = !, 2, . . . . , «) be repre- MOMENTS AND MATHEMATICAL EXPECTATION 19 sented as ^'-ordinates. Then the moment of order k of the ordinates about the ^^-axis is defined as t=n (8) Mi=2)Mf. t=l The mathematical expectation of the ^th power of x is likewise defined as the second member of this equality so that the y^th moment of the theoretical distribution and the mathematical expectation of the ^th power of the variable x are identical. When we have a theoretical distribution ranging from x = a to x = h, and given by a frequency function (p. 13) y = F(x)j we write in place of (8) M*= Cx'F{x)dx, H where F{x)dx gives, to within infinitesimals of higher or- der, the probability that a value of x taken at random falls in any assigned interval x to x-^dx. When the axis of moments is parallel to the y-axis and passes through the arithmetic mean or centroid x of the variable x, the primes will be dropped from the /i's which denote the moments. Thus, we write (9) i^k=\^ft{x,-xy^\'Y,a^t-i^i) t=\ where the arithmetic mean of the values of x is x = ix[. The square root of the second moment M2 about the arithmetic mean is called the standard deviation and is 20 NATURE OF PROBLEMS AND CONCEPTS very commonly denoted by <7. That is, the standard de- viation is the root-mean-square of the deviations of a set of numbers from their arithmetic mean. In the language of mechanics, o- is the radius of gyration of a set of 5 equal particles, with respect to a given centroidal axis. It is often important to be able to compute the mo- ments about the axis through the centroid from those about an arbitrary parallel axis. For this purpose the fol- lowing relations are easily established by expanding the binomial in (9) and then making some slight simplifica- tions: /xo = Mo=l, Mi = 0, M2 = M2 — M?, M3 = M3 — 3iUiM2+2Mi^ , M4 = M4-4/xi/x5+6MlV2-3/iiS » = ^ ^ where \i) iHn-t )! is the number of combinations of n things taken i at a time. These relations are very useful in certain problems of practical statistics because the moments ju* (^ = 1, 2, . . . .) are ordinarily computed first about an axis con- veniently chosen, and then the moments ju* about the MOMENTS AND MATHEMATICAL EXPECTATION 2 1 parallel centroidal axis may be found by means of the above relations. In particular, tii — li — lA^ expresses the very important relation that the second moment ni about the arithmetic mean is equal to the second moment jiia about an arbitrary origin diminished by the square /xi^ of the arithmetic mean measured from the arbitrary origin. This is a familiar proposition of elementary me- chanics when the mean is replaced by the centroid. When we-pass from (9) to corresponding expectations, the relation M2 = M2 — Mi^, written in the form /4 = Mi^+M2, tells us that the expected value, E{x^), of x^ is equal to the square, [£(rt:)]^ of the expected value of x increased by the expected value, E\^x — E{x)Y\, of the square of the deviations of x from its expected value. CHAPTER II RELATIVE FREQUENCIES IN SIMPLE SAMPLING 9. The binomial description of frequency. In Chap- ter I attention was directed to the very simple process of finding the relative frequency of occurrence of an event or character.among s cases in question. Let us now con- ceive of repeating the process of finding relative fre- quencies on many random samples each consisting of s items drawn from the same population. To characterize the degree of stability or the degree of dispersion of such a series of relative frequencies is a fundamental statistical problem. To illustrate, suppose we repeat the throwing of a set of 1,000 coins many times. An observed frequency dis- tribution could then be exhibited with respect to the number of heads obtained in each set of 1,000, or with respect to the relative frequency of heads in sets of 1,000. Such a procedure would be a laborious experimental treat- ment of the problem of the distribution of relative fre- quencies from repeated trials. What we seek is a mathe- matical method of obtaining the theoretical frequency dis- tribution with respect to the number of heads or with respect to the relative frequency of heads in the sets. To consider a more general problem, suppose we draw many sets of 5 balls from an urn one at a time with re- placements, and let p be the probability of success in drawing a white ball in one trial. The problem we set is to determine the theoretical frequency distribution with 22 BINOMIAL DESCRIPTION OF FREQUENCY 23 respect to the number of white balls per set of s, or with respect to the relative frequency of white balls in the sets. To consider this problem, let q be the probability of failure to draw a white ball in one trial so that p+q = \. Then the probabilities of exactly w = 0, 1, 2, . . . . , 5 successes in s trials are given by the successive terms of the binomial expansion (1) where {qi-py = qs^spq^-'+ (J^ p'q^-^ (s\/ s \^_A \m) \s-mj fnl(s- m)r Derivations of this formula for the probability of m successes in s trials from certain definitions of probability are given in books on college algebra for freshmen. For a derivation starting from the definition of probability as a limit, the reader is referred to Coolidge.^ A frequency distribution with class frequencies proportional to the terms of (1) is sometimes called a Bernoulli distribution. Such a theoretical distribution shows not only the most probable distribution of the drawings from an urn, as described above, but it serves also as a norm for the dis- tribution of relative frequencies obtained from some of the simplest sampling operations in applied statistics. For example, the geneticist may regard the Bernoulli dis- tribution (1) as the theoretical distribution of the rela- tive frequencies m/s of green peas which he would obtain * See references on pp. 173-77. 24 RELATIVE FREQUENCIES IN SIMPLE SAMPLING among random samples each consisting of a yield of s peas. The biologist may regard (1) as the theoretical dis- tribution of the relative frequencies of male births in random samples of s births. The actuary may regard (1) as the theoretical distribution of yearly death-rates in samples of s men of equal ages, say of age 30, drawn from a carefully described class of men. In this case we specify that the samples shall be taken from a care- fully described class of men because the assump- tions involved in the urn schemata underlying a Bernoulli distribution do not permit a careless se- lection of data. Thus, it would not be in accord with the assumptions to take some of the samples from a group of teachers with a relatively low rate of mortality and others from a group of anthracite coal miners with a relatively high rate of mortality. The fact stated at the beginning of this section that we are concerned with repeating the process of drawing from the same population is intended to imply that the same set of circumstances essential to drawing a random sample shall exist throughout the whole series of draw- ings. The expression "simple sampling'' is sometimes ap- plied to drawing a random sample when the conditions for repetition just described are fulfilled. In other words, simple sampling implies that we may assume the imder- lying probability p of formula (1) remains constant from MOST PROBABLE VALUE 25 sample to sample, and that the drawings are mutually independent in the sense that the results of drawings do not depend in any significant manner on what has hap- pened in previous drawings. In Figure 2 the ordinates at a: = 0, 1, 2, . . . . ,7 show the values of terms of (1) for /> = g = l/2, 5 = 7. To find the "most probable" or modal number of successes m' in s trials, we seek the value oi m = m' which gives a maximum term of (1). To find this value of m, we write the ratios of the general term of (1) to the preceding and the succeeding terms. The first ratio will be equal to or greater than imity when s-m±\p ^^^ m q r I r In the same way, the second ratio will be equal to or greater than unity when T^l or m>P5 — q, s—m p ~^ ^ We have, thus, the integer m=rn/ which gives the modal value determined by the inequalities, ps—q^m'^ps+p . We may say therefore that, neglecting a proper fraction, ps is the most probable or modal number of successes. When ps — q and ps-\-p are integers, there occur two equal terms in (1) each of which is larger than any other term of the series. For example, note the equality of the first and second terms of the expansion (5/6+1/6)^. 26 RELATIVE FREQUENCIES IN SIMPLE SAMPLING 10. Mathematical expectation and standard deviation of the number of successes. Let fh be the mathematical expectation of the number of successes, or what is the same thing, the arithmetic mean number of successes in s trials under the law of repeated trials as defined by formula (1) on page 23. We shall now prove that fn = ps. By definition (§6), (2) _ "ST^ si {s-m) m = S\ (3) =spy 7 \^ ^.p^-'q'-^^sp smce f«=l ^ ^ Let d = m — sp be the discrepancy of the number of successes from the mathematical expectation, and let a^ be the mathematical expectation of the square of the dis- crepancy. By definition, (r2 = V -—j^ — - p^ q'-^{m - spy m = (*) ^ 2 ,;dFr^^'" «'""■('"' -2'«^^+''^>- „ {s-m) m = DISPERSION OF NUMBER OF SUCCESSES 27 We shall now prove that a^ = spq. To do this, we write w2 = w+w(w — 1) and obtain for the first term of (4) the value (5) '^nd is m pmqs. ml(s — m)\ = sp+s{s-l)f . From (2), (3), (4), and (5), we have (6) f^2^Sp + sis-l)p^-25^p^ + sY = sp{l-p)=spq . The measure of dispersion a is often called the standard deviation of the frequency of successes in the population. Next, we define d/s = {m/s)—p as the relative dis- crepancy, for it is the difference between the probability of success and the relative frequency of success. The mean square of the relative discrepancy is the second member of equation (4) divided by s^. It is clearly equal to the mean square a^ of the discrepancy divided by s^, which gives (7) ?. The theoretical value of the standard deviation of the relative frequency of successes is then (pq/sY^^. 11. Theorem of Bernoulli. The theorem of Bernoulli deals with the fundamental problem of the approach of the relative frequency m/s of success in s trials to the 28 RELATIVE FREQUENCIES IN SIMPLE SAMPLING underlying constant probability p sls s increases. The theorem may be stated as follows: In a set of s trials in which the chance of a success in each trial is a constant p, the pr oh ability P of the relative discrepancy {m/s)—p being numerically as large as any as- signed positive number e "will approach zero as 'a limit as the number of trials s increases indefinitely ^ and the probability , Q = l—F, of this relative discrepancy being less than e ap- proaches 1 or certainty. This theorem is sometimes called the law of large numbers. The theorem has been very commonly regarded as the basic theorem of mathematical statistics. But with the definition of probability (p. 8) as the limit of the rela- tive frequency, this theorem is an immediate consequence of the definition. While it adds to the definition some- thing about the manner of approach to the limit the theorem is in some respects not so strong as the corre- sponding assumption in the definition. With a definition of probability other than the limit definition, the theorem may not follow so readily. It has been regarded as fundamental because of its bearing on the use of the relative frequency m/s {s large) as if it were a close approximation to the probability p. Assum- ing for the present that we have any definition of the probability p of success in one trial from, which we reach the law of repeated trials given in the binomial expansion (1), we may prove the Bernoulli theorem by the use of the Bienayme-Tchebycheff criterion. To derive this criterion, consider a statistical variable X which takes mutually exclusive values Xij x^j . . . . ^x^ with probabilities pu P2, - * - - y pn, respectively, where pi-\-p2+ +/>,= !. THEOREM OF BERNOULLI 29 Let a be any given number from which we wish to measure deviations of the a;'s. A specially important case is that in which a is a mean or expected value of x, al- though a need not be thus restricted. For the expected mean-square deviation from a, we may write cr2 = M+M+ +M^, where c?, = ic, — a. Let d\ d'\ . . . . , be those deviations Xi—a which are at least as large numerically as an assigned multiple € = Xa-(X>l) of the root-mean-square deviation <t from a, and let p\ p" , . . . . , be the corresponding probabili- ties. Then we have (8) <r2^/>V'2-f/>"J"2^_ ..... Since d\ d" y , . . . , are each numerically equal to or greater than Xtr, we have from (8) that (T^^XVV-f />''+...•). If we now let P{\(t) be the probability that a value of X taken at random from the "population" will differ from a numerically by as much as Xcr, then F{\a)=p^-\-p" + ...., and (72^XV2P(X(7). Hence (9) P(Xcx)^^,. To illustrate numerically we may take a to be the arith- metic mean of the re's and say that the probability is not more than 1/25 that a variate taken at random will 30 RELATIVE FREQUENCIES IN SIMPLE SAMPLING deviate from the arithmetic mean as much as five times the standard deviation. A striking property of the Bienayme-TchebychefT criterion is its independence of the nature of the distribu- tion of the given values. In a slightly different form, we may state that the probability is greater than 1 - l/X^, that a variate taken at random will deviate less than X<7 from the mathe- matical expectation. This theorem is ordinarily known as the inequality of Tchebycheff,^ but the main ideas underlying the inequality were also developed by Bie- nayme.^ We shall now turn our attention more directly to the theorem of Bernoulli. We seek the probability that the relative discrepancy {m/s) — /? will be numerically as large as an assigned positive number e. We may take e = \{pq/sy^^, a multiple of the theoreti- cal standard deviation (pq/sY^^ of the relative frequencies m/s. (See §10). Let P be the probability that then from the Bienayme-Tchebycheff criterion (9), we have P^l/X2. Since 1 i /^n\ 1/2 pq we have P ^ ^ . J=!(?) S€ For any assigned e, we may by increasing s make P small at pleasure. That is, the probability P that the relative frequency m/s will differ from the probability p by at least as much as an assigned number, however small, DE MOIVRE-LAPLACE THEOREM 31 tends toward zero as the number of cases s is indefinitely increased. For example, if we are concerned with the probability P that I {m/s)-p I ^ .001, we see that P ^ l,000,OOO^gA. If the number of trials s is not very large, this inequality would ordinarily put no important restriction on P. But as s increases indefinitely, l,000,000/>g remains constant, and l,000,000/>g/5 approaches zero. Again, the probabil- ity Q = \—P that I {in/s)—p\ is less than e satisfies the condition (10) 01-S- From (10) we see that with any constant pq/^^, the probability Q becomes arbitrarily near 1 or certainty as s increases indefinitely. Hence the theorem is established for any definition of probability from which we derive (1) as the law of repeated trials. It seems that the statement of the theorem concern- ing the probable approach of relative frequencies to the underlying probabihty may appear simpler and more ele- gant by the use of the concept of asymptotic certainty introduced by E. L. Dodd in a recent paper.^ According to this concept, we may say it is asymptotically certain that m/s will approach /> as a limit as s increases in- definitely. 12. The De Moivre-Laplace Theorem.' The De Moivre-Laplace theorem deals with the probability that the number of successes w in a set of 5 trials will fall within a certain conveniently assigned discrepancy d from the mathematical expectation sp. By the inequality of Tchebycheff (p. 30) a lower limit to the value of this probability has been given. We shall now proceed to con- 32 RELATIVE FREQUENCIES IN SIMPLE SAMPLING sider the problem of finding at least the approximate value of the probability. This problem would, in the simplest cases, involve merely the evaluation and addi- tion of certain terms of the expansion (1). But this pro- cedure would, in general, be impracticable when s is large and d even fairly large. To visualize the problem we represent the terms of (1) by ordinates y, at unit inter- vals where x marks deviations of m from the mathe- matical expectation of successes ps as an origin. Then we have The probability that the number of successes will lie within the interval ps—d and ps+d, inclusive of end values, is then the sum of the ordinates +<* (12) y_d+y_(d_i)+ • • • • +yo+yiH- • • • • +yd= 2^^' ' -d As the number of y's in this sum is likely to be large, some convenient method of finding the approximate value of the sum will be found useful. In attacking this problem, we shall first of all replace the factorials in (11) approximately by the first term of Stirling's formula for the representation of large factorials. This formula^ states that (13) „. = „»e-"(2«)>/^(l+J^+23|;^+....). To form an idea of the degree of approximation obtained by using only the first term of this formula, we may say that in replacing n\ by n**e~*'{2'irny^^ we obtain a result ■«»+*-l REDUCTION BY STIRLING'S FORMULA ^3 equal to the true value divided by a number between 1 and l + l/\On. The use of this first term is thus a sufl5- ciently close approximation for many purposes if n is fair- ly large. The substitution by the use of Stirling's formula for factorials in (11) gives, after some algebraic simplifica- tion, approximately. To explain further our conditions of approximation to (11), we naturally compare any individual discrepancy X from the mathematical expectation ps with the standard deviation a^ispqY^^. We should note in this connection that <T is of order s^^^ if neither p nor q is extremely small. This fact suggests the propriety of assuming that s is so large that x/s shall remain negligibly small, but that x/s'^^^ may take finite values such as interest us most when we are making comparisons of a discrepancy with the standard deviation. It is important to bear in mind that we are for the present dealing with a particular kind of approximation. Under the prescribed conditions of approximation, we shall now examine (14) with a view to obtaining a more convenient form for y^. For this purpose, we may write [X X^ 0^ 1 (15) (16) 34 RELATIVE FREQUENCIES IN SIMPLE SAMPLING where 4){x) and <t>i(x) are finite because each of them represents the sum of a convergent power series when x/s is small at pleasure. From (14), (15), and (16), 2spq s where <t>z{x) is clearly finite. Now if s is so large that (x/s) <f)3(x) becomes small, we have 1 __£L y'- ilirspqy/^ ' '''' as an approximation to v, in (11). As a first approximation to the sum of the ordinates in (12), we then write the integral 1 r+'^ _j>_ ^ (^7) {2rspgy^'^J., ^"^^^^^- This integral is commonly known as the probability integral. The ordinates of the bell-shaped curve (Fig, 3) represent the values of the function 1 _*L ^"^(2^0^^ ^'^ where (T^ = spq. This curve is the normal frequency curve and will be further considered in Chapter III. We may increase slightly the accuracy of our ap- proximation by taking account of the fact that we have one more ordinate in (12) than intervals of area. We may DE MOIVRE-LAPLACE THEOREM 35 therefore appropriately add an ordinate at x = d to the value given in (17), and obtain (18) e 2spq dx- 1 {lirspq) 1/2 e 2s pq , for the probability that the discrepancy is between — d and d inclusive of end points. Another method of taking account of the extra ordinate is to extend the limits of integration in (17) by one-half the unit at both the upper and lower limits. That is, we write (19) (2 _i n^'^ e 2spq dx in place of (17). We may now state the De Moivre-Laplace theorem: Given a constant probability p of success in each of s trials where s is a large number, the probability that the discrepancy m—sp of the number m of successes from the mathematical expectation will not exceed nufnerically a given positive number d is given to a first approximation by (17) and to closer approximations by (18) and (19). 36 RELATIVE FREQUENCIES IN SIMPLE SAMPLING Although formulas (17), (18), and (19) assume s large, it is interesting to experiment by applying these formulas to cases in which s is not large. For example, consider the problem of tossing six coins. The most prob- able number of heads is 3, and the probabiHty of a dis- crepancy equal to or less than 1 is given exactly by / 6! 6! 6! \ 1 ^25 V3!3!"^4!2l"*"2!4!/64 32' which is the sum of the probabilities that the number of heads will be 2, 3, or 4 for 5 = 6 coins. But spq = l.Sy and (spqY^^ = 1.225. Then using -3/2 to 3/2 as limits of integration in (19), we obtain from a table of the probabil- ity integral the approximate value .779 to compare with the exact value 25/32 = .781. For certain purposes, there is an advantage in chang- ing the variable a: to / in (17) and (18) by the transforma- tion - _, _i . (spqy/^ ' ispqy^' Then in place of (17) we have and in place of (18) we have DE MOIVRE-LAPLACE THEOREM 37 To give a general notion of the magnitude of the probabilities, we shall now list a few values of P& in (20) corresponding to assigned values of 5. Thus, 5... .6745 12 3 4 P5... .5 .68269 .95450 .99730 .99994 Extensive tables giving values of the probability integral and of the ordinates of the probability curve are readily available. For example, the Glover Tables oj Ap- plied Mathematics^ give P5/2 for the argument b = x/(T Sheppard's table^ gives {\-\-Pi)/2 for the argument 5 = x/<T. We may now state the De Moivre-Laplace theorem in another form by saying that, the values of P5 in (20) and (21) give approximations to the probability that I m — sp I <8{spqy^^ for an assigned positive value of 5. In still another slightly different form involving rela- tive frequencies, we may state that the values of Ps in (20) and (21) give approximations to the probability that the absolute value of the relative discrepancy satisfies the inequality <<?)■'■ (22) : 7 for every assigned positive value of 5. In order to gain a fuller insight into the significance of the De Moivre-Laplace theorem we may draw the follow- ing conclusions from (20) : (a) Assuming as is suggested by (20) that a 5 exists corresponding to every assigned probability Pa, we find from d = d{spqy^^ that the bounds — d to -\-d increase in proportion to s^^^ as 5 is increased (b) From (20) and (22) it follows that for assigned prob- 38 RELATIVE FREQUENCIES IN SIMPLE SAMPLING abilities Pa the bounds of discrepancy of the relative frequency mis from p vary inversely as 5^^^, To illustrate the use of the De Moivre-Laplace theorem, we take an example from the third edition of the American Men of Science by Cattell and Brimhall (p. 804). A group of scientific men reported 1,705 sons and 1,527 daughters. The examination of these numbers brings up the following fundamental questions of simple sampling. Do these data conform to the hypothesis that 1/2 is the probability that a child to be born will be a boy? That is, can the deviations be reasonably regarded as fluctuations in simple sampling under this hypoth- esis? In another form, what is the probabihty in throw- ing 3,232 coins that the number of heads will differ from (3,232/2) = 1,616 by as much, as or more than, 1,705-1,616 = 89? In this case, 5 = 3,232, (/>^5) 1/2 =28.425, {pqs) t/= 1,705- 1,616 = 89, j—r^^ = Z.n\ Referring to a table of the normal probability in- tegral, we find from (20) that P = .9983. Hence, the prob- ability that we will obtain a deviation more than 89 on either side of 1,616 in a single trial is approximately 1-. 9983 = .0017. 13. The quaxtile deviation. The discrepancy d which corresponds to the probability P = 1/2 in (20) is some- times called the quartile deviation, or the probable error of m as an approximation to sp. By the use of a table of the probability integral, it is found from (20) that J = .6745 {spqY^^ approximately POISSON EXPONENTIAL FUNCTION 39 when P=\/2, and thus .61^S{spqy^'^ is the quartile deviation of the number of successes from the expecta- tion sp. 14. The law of small probabilities. The Poisson ex- ponential function. The De Moivre-Laplace theorem does not ordinarily give a good approximation to the terms of the binomial {p-{-qy ii p or q is small. If s is large but sp or sq is small in relation to s, we may give a useful representation of terms of the binomial expansion (p+qY by means of the Poisson exponential function. Statistical examples of this situation are what may be called rare events and may easily be given : The number born blind per year in a city of 100,000, or the number dying per year of a minor disease. Poisson^° had already as early as 1837 given the func- tion involved in the treatment of the problem. Bort- kiewicz" took up the problem in connection with a long series of observations of events which occur rarely. For example, one well-known series he gave was the frequency distribution of the number of men killed per army corps per year in the Prussian army from the kicks of horses. The frequency distribution of the number of deaths per army corps per year was: Deaths 1 2 3 4 Frequency 109 65 22 3 1 He called the law of frequency involved the "law of small numbers," and this name continues to be used although it does not seem very appropriate. The expression ''law of small probabilities" seems to give a more accurate de- scription. Assume, then, that the probability p is small and that q = l-pis nearly unity. That is, p is the prob- 40 RELATIVE FREQUENCIES IN SIMPLE SAMPLING ability of the occurrence of the rare event in question in a single trial. We then seek a convenient expression approximately equal to mini ^ ' the probability of m occurrences and n non-occurrences in m+n = s trials. Replacing si and n ! by means of Stirling's formula we obtain {l — m/sy+*ml With large values of s and relatively small values of w, (l — m/sY'^^^^ differs relatively little from (1— m/^)/ and this in turn differs relatively little from e~"*. Further- more, ^" = (1—^)" differs very little from e"**^ since, on the one hand, and, on the other, Introducing these approximations by substituting e' for (1— m/5)*+^/^, and e~"^ for g**, we have tt ml POISSON EXPONENTIAL FUNCTION 41 For rare events, of small probability />, np differs very little from sp — X. Hence, we write (23) ^- = ^ for the approximate probability of m occurrences of the rare event. Then the terms of the series '-^('+^+^+ii+----) give the approximate probabilities of exactly 0, 1, 2, . . . . , occurrences of the rare event in question, and the sum of series (24) e-(i+x+|+|+....+^) gives the probability that the rare event F will happen either 0, 1, 2, . . . . , or w times in s trials. Although we have assumed in deriving the Poisson exponential function X"*e~^/w! that m is small in com- parison with ^, we may obtain certain simple and inter- esting results for the mathematical expectation and stand- ard deviation of the distribution given by the Poisson exponential when m takes all integral values from w = to w = 5. Thus, when w = 5 in (24), we clearly have (25) e-(l+X+|+|+-.-.+^^) = l approximately if s is large. Since the successive terms in (25) give approximately the probabilities of 0, 1, 2, . . . . , 5 occurrences of the 42 RELATIVE FREQUENCIES IN SIMPLE SAMPLING rare event, the mathematical expectation X^wP«, of the number of such occurrences is .-(o+x+xH^;+....+(^,)= (\2 \5-l \ approximately when s is large. Similarly, the second moment /iz about the origin is ^=,-x[x+2V+|'+ .... +^] , and the second moment about the mathematical expec- tation is [•2X2 cX*-l "1 (26)1 =^-i^+^+i;+(S)!] +X^e-[l+X+|;+....+^]-X^ = X+X^— X^ nearly =sp , an approximation to spq since q differs but little from 1. Tables of the Poisson exponential limit e~^\*/x\ are given in Tables for Statisticians and Biometricians (pp. 113-24), and in Biometrika, Volume 10 (1914), pages 25- 35. The values of e~'^\*/x\ are tabulated to six places of decimals for X varying from .1 to 15 by intervals of one- tenth and for x varying from to 37. POISSON EXPONENTIAL FUNCTION 43 A general notion of the values of the function for certain values of X may be obtained from Figure 4 where the ordinates at 0, 1, 2, .... , show the values of the function for X = .5, 1, 2, and 5. Miss Whittaker has prepared special tables {Tables for Statisticians and Biometricians^ pp. 122-24) which facilitate the comparison of results from the Poisson ex- ponential with those from the De Moivre-Laplace theory in dealing with the sampling fluctuations of small frequencies. The question naturally arises as to the value of p below which we should prefer to use the Poisson exponen- tial in dealing with the probability of a discrep- ancy less than an assigned number in place of the results of the De Moivre-Laplace theory. While there is no exact answer to this question, there seems to be good reason for certain purposes in restricting the application of the De Moivre-Laplace results to cases where the probability is perhaps not less than ^ = .03. To illustrate by a concrete situation in which p is small, consider a case of 6 observed deaths from pneu- monia in an exposure of 10,000 lives of a well-defined class aged 30 to 31. It is fairly obvious, on the one hand, thac the possible variations below 6 are restricted to 6, whereas there is no corresponding restriction above 6. On the other hand, if we take (6/10,000) = 3/5,000 as the prob- y y e x\ , \ \ lLJ \\ \ \ y^^ 1/ N \ \ \ \ / \ \ <^ ■<. ■ \ J\ := b= -- Fig. 4 9 10 11 12 X 44 RELATIVE FREQUENCIES IN SIMPLE SAMPLING ability of death from pneumonia within a year of a per- son aged 30, it is more likely that we shall experience 5 deaths than 7 deaths among the 10,000 exposed; for the probability / 10,000\ / 4,997 y-^QY 3 y \ 5 A5,000/ V5,000/ of 5 deaths is greater than the probability /10,000\ / 4,997 y'QQ3 / 3 y V 7 /V5,000/ \5,000/ of 7 deaths. Suppose we now set the problem of finding the prob- ability that upon repetition with another sample of 10,000, the deviation from 6 deaths on either side will not exceed 3. The value to three significant figures calcu- lated from the binomial expa^ision is .854. To use the De Moivre-Laplace theorem, we simply make J = 3 in (19), and obtain from tables of probability functions the value P3 = . 847. We should thus expect from the De Moivre-Laplace theorem a discrepancy either in defect more than 3 or in excess more than 3 in 100—84.7 = 15.3 per cent of the cases, and from the sum of the binomial terms we should expect such a discrepancy in 100— 85.4 = 14.6 per cent of the cases. Turning next to tables of the Poisson exponential, page 122 of Tables for Statisticians and BiometricianSy we find that in 6.197 per cent of cases there will be a dis- crepancy in defect more than 3 and in 8.392 per cent of cases there will be a discrepancy in excess more than 3. APPLICATIONS 45 The sum of 6.197 and 8.392 per cent is 14.589 per cent. This result differs very little for purposes of dealing with sampling errors from the 15.3 per cent given by the De Moivre-Laplace formula, but it is a closer approxima- tion to the correct value and has the advantage of showing separately the percentage of cases in excess more than the assigned amount and the percentage in defect more than the same amount. CHAPTER III FREQUENCY FUNCTIONS OF ONE VARIABLE 15. Introduction. In Chapter I we have discussed very briefly three different methods of describing frequency distributions of one variable — the purely graphic method, the method of averages and measures of dispersion, and the method of theoretical frequency functions or curves. The weakness and inadequacy of the purely graphic meth- od lies in the fact that it fails to give a numerical descrip- tion of the distribution. While the method of averages and measures of dispersion gives a numerical description in the form of a summary characterization which is likely to be useful for many statistical purposes, particularly for purposes of comparison, the method is inadequate for some purposes because (1) it does not give a character- ization of the distribution in the neighborhood of each point X or in each small interval xtox-{-dxoi the variable, (2) it does not give a functional relation between the values of the variable x and the corresponding frequen- cies. To give a description of the distribution at each small interval x to x+dx and to give a functional relation be- tween the variable x and the frequency or probability we require a third method, which may be described as the "analytical method of describing frequency distribu- tions." This method uses theoretical frequency functions. That is, in this method of description we attempt to char- acterize the given observed frequency distribution by ap- 46 NORMAL FREQUENCY FUNCTION 47 pealing to underlying probabilities, and we seek a fre- quency function y = F(:«:) such that F{x)dx gives to within infinitesimals of higher order the probability that a vall- ate x' taken at random falls in the interval x to x+dx. Although the great bulk of frequency distributions which occur so abundantly in practical statistics have cer- tain important properties in common, nevertheless they vary sufficiently to present difficult problems in consider- ing the properties of F(x) which should be regarded as fundamental in the selection of an appropriate function to fit a given observed distribution. The most prominent frequency function of practical statistics is the normal or so-called Gaussian function <« ^=^^ ^ where a is the standard deviation (see Fig. 3, p. 35). Although Gauss made such noteworthy contributions to error theory by the use of this function that his name is very commonly attached to the function, and to the corresponding curve, it is well known that Laplace made use of the exponential frequency function prior to Gauss by at least thirty years. It would thus appear that the name of Laplace might more appropriately be attached to the function than that of Gauss. But in a recent and very interesting historical note, Karl Pearson* finds that De Moivre as early as 1733 gave a treatment of the prob- ability integral and of the normal frequency function. The work of De Moivre antedates the discussion bf La- place by nearly a half-century. Moreover, De Moivre's 48 FREQUENCY FUNCTIONS OF ONE VARIABLE treatment is essentially our modern treatment. Hence it appears that the discovery of the normal function should be attributed to De Moivre, and that his name might be most appropriately attached to the function. It may well be recalled that we obtained this function (1) in the De Moivre-Laplace theory (p. 34). In (1) the origin is taken so that the a;-co-ordinate of the centroid of area under the curve is zero. The approximate value of the centroid may be obtained from a large number of observed variates by finding their arithmetic mean. The a is equal to the radi- us of gyration of the area under the curve with respect to the ^'-axis, and is obtained approximately from ob- served variates by finding their standard deviation. The probability or frequency function (1) has been derived from a great variety of hypotheses. ^^ The difficulty is not one of deriving this function but rather one of establish- ing a high degree of probability that the hypotheses un- derlying the derivation are realized in relation to practical problems of statistics. In the decade from 1890 to 1900, it became well estab- lished experimentally that the normal probability func- tion is inadequate to represent many frequency distribu- tions which arise in biological data. To meet the situation it was clearly desirable either to devise methods for char- acterizing the most conspicuous departures from the normal distributions or to develop generalized frequency curves. The description and characterization of these de- partures without the direct use of generalized frequency curves has been accomplished roughly by the introduction (see pp. 68-72) of measures of skewness and of peakedness (excess or kurtosis), but the rationale underlying such measures is surely to be sought most naturally in the GENERALIZED FREQUENCY CURVES 49 properties of generalized frequency functions. In spite of the reasons which may thus be advanced for the study of generalized frequency curves, it is fairly obvious that, for the most part, the authors of the rather large number of recent elementary textbooks on the methods of statistical analysis seem to regard it as undesirable or impracticable to include in such books the theory of generalized fre- quency curves. The writer is inclined to agree with these authors in the view that the complications of a theory of generalized frequency curves would perhaps have carried them too far from their main purposes. Nevertheless, some results of this theory are important for elementary statistics in providing a set of norms for the description of actual frequency distributions. In order to avoid mis- understanding it should perhaps be said that it is not intended to imply that a formal mathematical representa- tion of many numerical distributions is desirable, but rather that a certain amount of such representation of carefully selected distributions should be encouraged. A useful purpose will be served in this connection if we can make certain points of interest in the theory more accessi- ble by means of the present monograph. The problem of developing generalized frequency curves has been attacked from several different directions. Gram (1879), Thiele (1889), and Charlier (1905) in Scan- dinavian countries; Pearson (1895) and Edgeworth (1896) in England; and Fechner (1897) andBruns (1897) in Germany have developed theories of generalized fre- quency curves from viewpoints which give very different degrees of prominence to the normal probability curve in the development of a more general theory. In the present monograph, special attention will be given to two systems 50 FREQUENCY FUNCTIONS OF ONE VARIABLE of frequency curves — the Pearson system and the Gram- Charlier system. 16. The Pearson system of generalized frequency curves. Pearson's first memoir^^ dealing with generalized frequency curves appeared in 1895. In this paper he gave four types of frequency curves in addition to the normal curve, with three subtypes under his Type I and two sub- types under his Type III. He published a supplementary memoir^* in 1901 which presented two further types. A second supplementary memoir^^ which was published in 1916 gave five additional t3^es. Pearson's curves, which are widely different in general appearance, are so well known and so accessible that we shall take no time to comment on them as graduation curves for a great variety of frequency distributions, but we shall attempt to indi- cate the genesis of the curves with special reference to the methods by which they are grounded on or associated with underlying probabilities. We shall consider a frequency function y = F(x) of one variable where we assume that F{x)dx differs at most by an infinitesimal of higher order from the probability that a variate x taken at random will fall into the interval x to x-\-dx. Pearson's types of curves y = F(ic) are obtained by integration of the differential equation (2) (fy^ {x-{-a)y dx Co-\-CiX-\-C2X^ * and by giving attention to the interval on ic in which y = F{x) is positive. The normal curve is given by the spe- cial case Ci = C2 = 0. We may easily obtain a clear view of the genesis of the system of Pearson curves in relation to PEARSON'S FREQUENCY CURVES 51 laws of probability by following the early steps in the de- velopment of equation (2) . The development is started by representing the probabiKties of successes in n trials given by the terms of the symmetric point binomial (1/2 + 1/2)** as ordinates of a frequency polygon. It is then easily proved that the slope dy/dx of any side of this polygon, at its midpoint, takes the form (3) g=_42(^ + a)y, where y is the ordinate at this point, and a and k are constants. By integration, we obtain the curve for which this differential equation is true at all points. The curve thus obtained is the normal curve (Pearson's Type VII). The next step consists in dealing with the asymmetric point binomial {p-\-qYi p=^q, in a manner analogous to that used in the case of the symmetric point binomial. This procedure gives the differential equation dy_ (x-\-a)y dx Co-\-CiX * from which we obtain by integration the Pearson Type III curve (4) y=yo(i+^) That is, with respect to the slope property, this curve stands in the same relation to the values given by the asymmetric binomial polygon as the normal curve does to values given by the symmetric binomial. Thus far the underlying probability of success has 52 FREQUENCY FUNCTIONS OF ONE VARIABLE been assumed constant. The next step consists in taking up a probability problem in which the chance of success is not constant, but depends upon what has happened previously in a set of trials. Thus, the chance of getting r white balls from a bag containing np white and nq black balls in drawing s balls one at a time without replacements is given by (np)\{nq)\{n-s)\s\ {n)s (np-r)[{nq-s-{-r)\nlrlis-r)\ ' where («), means the number of permutations of n things taken 5 at a time and (^) is the number of combinations of s things r at a time. This expression is a term of a hypergeometric series. By representing the terms of this series as ordinates of a frequency polygon, and finding the slope of a side of the frequency polygon, and proceeding in a manner analogous to that used in the case of the point binomial, we obtain a differential equation of the form given in (2). Thus, we maker = 0, 1, 2, . . . . , ^and obtain the 5+ 1 ordinates yo, yi, y2, . . . . , y* at unit inter- vals. At the middle point of the side joining the tops of ordinates yr and yr+i, we have (6) «=f+§ , y^iiyr+yr+i) , and (7) =yr^i-y.=y.[^q-f — S'{-nps — nq—l — r{n-\-2) "■^^ {r+l){r+l+nq-s) dy \^'~^ np—r PEARSON'S FREQUENCY CURVES S3 From 3;= (>'r+>'f-t-i)/2, we have (np-r){s-r) (8) -hr[ {r-\-l){r+l + nq-s) fl _j nps-{-nq-\-l — s-{-r{nq-]-2 — np — 25)-^2r'^ "^^"^ (r+l)(r+l + nq-s) From (7) and (8), replacing r by ic— 1/2, we have (9) 1 dy_ 2s+2nps-2nq-2-{2x-l){n-\-2) y dx~nps-\-nq+i-5+{x-i){nq+2-np-2s) + 2{x-iy From (9), we observe that the slope of the frequency polygon, at the middle point of any side, divided by the ordinate at that point is equal to a fraction whose numer- ator is a linear function of x and whose denominator is a quadratic function of x. The differential equation (2) gives a general statement of this property. It is more general than (9) in that the constants of (9) are special values found from the law of probability involved in drawings from a limited supply without replacements. One of Pearson's generalizations therefore consists in admitting as frequency curves all those curves of which (2) is the differential equation with- out the limitations on the values of the constants involved in (9). The questions involved in the integration of (2) and in the determination of parameters for actual distribu- tions are so available in Elderton's Frequency Curves and Correlation, and elsewhere, that it seems undesirable to take the space necessary to deal with these questions here. The resulting types of equations and figures that indicate the general form of the curves for certain positive values of the parameters are listed below. 54 FREQUENCY FUNCTIONS OF ONE VARIABLE Type I (Fig. 5) '-'■hiTHT' f Y where \. Wi tK2 ^--^ a, 0. —.o ^^^^ Type II (Fig. 6) y^y^ H) Je»a FlQ.6 Type in (Fig. 7) Y -T« y=y^ i'-t) x=-a Type IV Fig. 7 yyi'+S)'^'- A skew curve of unlimited range at both ends, roughly described in general appearance as a slightly deformed normal curve (for the normal curve, see Fig. 3, p. 35). PEARSON'S FREQUENCY CURVES Type V (Fig. 8) y = yQX~^e * . 55 Fig. 8 Type VI (Fig. 9) Y y=y{i{x—(j)^ix ^1 . Fig. 9 Type VII (Fig. 3, p. 35) _.,_., 2<r^ y^yoc The normal frequency curve. Type VIII (Fig. 10) y = yo (-0 x=-a Fig. 10 56 FREQUENCY FUNCTIONS OF ONE VARIABLE This type degenerates into an equilateral hyperbola when w = l. Type IX (Fig. 11) = ,o(l+^)". x=-a o Fig. 11 This type degenerates into a straight line when w = 1, Type X (Fig. 12) n ■ y = -e This type is Laplace's first frequency curve while the normal curve is sometimes called his second frequency curve. The curve is shown for negative values of ± x/a; Type XI (Fig. 13) y=yox- x=\ Fig. 13 PEARSON'S FREQUENCY CURVES Type XII (Fig. 14) 57 _ / ai-{-x \ Fig. 14 The above figures should be regarded as roughly illus- trating only in a meager way, for particular positive val- ues of the parameters, the variety of shapes that are assumed by the Pearson type curves. For example, it is fairly obvious that Types I and II would be U-shaped when the exponents are negative, and that Type III would be J-shaped if 7a were negative. The idea of obtaining a suitable basis for frequency curves in the probabilities given by terms of a hyper- geometric series is the main principle which supports the Pearson curves as probability or frequency curves, rather than as mere graduation curves. That is to say, these curves should have a wide range of applications as proba- bility or frequency curves if the distribution of statistical material may be likened to distributions which arise imder the law of probability represented by terms of a hyper- geometric series, and if this law may be well expressed by determining a frequency function y — F{x) from the slope of the frequency polygon of the hypergeometric series. In examining the source of the Pearson curves, the fact should not be overlooked that the normal probability curve can be derived from hypotheses containing much broader implications than are involved in a slope condi- tion of the side of a symmetric binomial polygon. 58 FREQUENCY FUNCTIONS OF ONE VARIABLE The method of moments plays an essential r61e in the Pearson system of frequency curves, not only in the de- termination of the parameters, but also in providing criteria for selecting the appropriate type of curve. Pear- son has attempted to provide a set of curves such that some one of the set would agree with any observational or theoretical frequency curve of positive ordinates by having equal areas and equal first, second, third, and fourth moments of area about a centroidal axis. Let Urn be the mth moment coefficient about a centroid vertical taken as the y-axis (cf. p. 19). That is, let (10) Mm= f'^x-^F{x)dx, Xoo x" 00 where F(x) is the frequency function (see p. 13). Next, let iSl=M3VM2' and Then it is Pearson's thesis that the conditions mo = 1, Ml = 0, together with the equality of the numbers ^2, ^i, and ft, for the observed and theoretical curves lead to equations whose solutions give such values to the par- ameters of the frequency function that we almost invaria- bly obtain excellency of fit by using the appropriate one of the curves of his system to. fit the data, and that bad- ness of fit can be traced, in general, to heterogeneity of data, or to the difficulty in the determination of moments from the data as in the case of J- and U-shaped curves. SELECTION OF TYPE OF CURVE 59 Let us next examine the nature of the criteria by which to pass judgment on the type of curve to use in any numerical case. Obviously, the form which the inte- gral y = F{x) obtained from (2) takes depends on the nature of the zeros of the quadratic function in the de- nominator. An examination of the discriminant of this quadratic function leads to equalities and inequalities in- volving 01 and 02 which serve as criteria in the selection of the type of function to be used. A systematic procedure for applying these criteria has been thoroughly developed and published in convenient form in Pearson's Tables for Statisticians and Biometricians (1914), pages Ix-Lxx and 66-67 ; and in his paper in The Philosophical Transactions j A, Volume 216 (1916), pages 429-57. The relations be- tween 01 and 02 may be conveniently represented by curves in the /^i-ft plane. Then the normal curve corre- sponds to the point i3i = 0, ^ = 3 in this plane. Type III is to be chosen when the point {0i, ft) is on the line 2/32 -3/3i- 6 = 0; and Type V, when (ft, ft) is on the cubic ft(ft+3)2 = 4(4ft-3ft)(2ft-3ft-6) . In considering subtypes under Type I, a biquadratic in 01 and ft separates the area of J -shaped modeless curves from the area of limited range modal curves and the area of U-shaped curves. Without going further into detail about criteria for the selection of the type of curve, we may summarize by saying that curves traced on the 0i ft-plane provide the means of selecting the Pearson type of frequency curve appropriate to the given distribution in so far as the neces- sary conditions expressed by relations between 0i and ft 6o FREQUENCY FUNCTIONS OF ONE VARIABLE turn out to be sufficient to determine a suitable type of curve. The difficulties involved in the numerical computation of the parameters of the Pearson curves were rather clear- ly indicated in Pearson's original papers. The appropriate tables and forms for computations in fitting the curves to numerical distributions have been so available in various books as to facilitate greatly the applications to concrete data. Among such books and tables, special mention should be made of Frequency Curves and Correlation (1906), by W. P. Elderton, pages 5-105; Tables for Statis- ticians and Biometricians (1924), by Karl Pearson; and Tables of Incomplete Gamma Functions (1921), by the same author. 17. Generalized normal curves — Gram-Charlier se- ries. Suppose some simple frequency function such as the normal function or the Poisson exponential function (p. 41) gives a rough approximation to a given frequency distribution and that we desire a more accurate analytic •representation than would be given by the simple fre- quency function. In this situation, it seems natural to seek an analytical representation by means of the first few terms of a rapidly convergent series of which the first term, called the "generating function,'' is the simple fre- quency function which gives the rough approximation. Prominent among the contributors to the method of the representation of frequency by a series may be named Gram,i6 Thiele,^^ Edgeworth,i8 Fechner,^^ Bruns,^^ Char- lier,2i 3^j^(j Romano vsky .22 Our consideration of series for the representation of frequency will be limited almost entirely to the Gram- Charlier generalizations of the normal frequency function GRAM-CHARLIER SERIES 6i and of the Poisson exponential function, by using these functions as generating functions. These two types of series may be written in the following forms: Type A (11) F(x) = ao(t>{x)+as<t>^'Kx)+ -f a„</>(«>W+ , where 1 ix-b) and <j)^''^(x) is the wth derivative of <l){x) with respect to x. Type B (12) F{x)=Co^P{x)-\-ClAHx)+ • • • • +CnA^{x)+ . . . . , where , , V e~^ sin xrc f 1 X , X^ 1 e-^\' which is the Poisson exponential for non-negative integral values of x^ and where A\l/(x), A^W, . . . . , denote the successive finite differences of \l/{x) beginning with ^^^{x)=^P{x)-^P{x-^l). If Type A or Type B converges so rapidly that terms after the second or third may be neglected, it is fairly 6 2 FREQUENCY FUNCTIONS OF ONE VARIABLE obvious that we have a simple analytic representation of the distribution. The general appearance of the curves represented by two or three terms of Type A, for particular values of the coefficients, is shown in Figure 15 so as to facilitate com- I. y=<i>{x)j the normal curve II. y=<f>{x)+M^'Kx) ni. y=<t>{x)+^<i>(^)ix)+j\<t>(^){x) parison with the corresponding normal curve represented by the first term. A general notion of the values of the function repre- sented by the first term of Type B may be obtained for particular values of X from Figure 4, page 43. When X is taken equal to the arithmetic mean of the number of occurrences of the rare event in question, we shall find that Ci = 0. We may then well inquire into the general appearance of the graph of the function TYPE A AND TYPE B 63 for particular values of C2. and X. For X = 2 and C2 = — 4, see Figure 16, which shows also the corresponding yf/{x). x=2 I. y=xl^{x) n. y=^(x)-.4AV(af) It should probably be emphasized that the usefulness of a series representation of a given frequency distribu- tion depends largely upon the rapidity of convergence. In turn the rapidity with which the series converges de- pends much upon the degree of approach of the generat- ing function to the given distribution. Although it is known^^ that the Type A series is capa- ble of converging to an arbitrary fimction/(a;) subject to certain conditions of continuity and vanishing at infinity, mere convergence is not sufficient for our problems. The representation of an actual frequency distribution re- quires, in general, such rapid convergence that only a few terms will be found necessary for the desired degree of approximation because (1) the amount of labor in compu- tation soon becomes impracticable as the number of terms 64 FREQUENCY FUNCTIONS OF ONE VARIABLE increases and (2) the probable errors of high-order mo- ments involved in finding the parameters would generally be so large that the assumption that we may use moments of observations for the theoretical moments will become invalid. 18. Remarks on the genesis of the T3rpe A and Tjrpe B forms. We naturally ask why a generalization of the normal frequency function should take the form of Type A rather than some other form, say the product of the generating function by a simple polynomial of low degree in X or by an ordinary power series mx. k similar ques- tion might be asked about the generalization of the Pois- son exponential function. There seems to be no very sim- ple answer to these questions. It is fair to say that alge- braic and numerical convenience, as well as suggestions from underlying probabiUty theory, have been significant factors in the selection of Type A and Type B. The alge- braic and numerical convenience of Type A becomes fairly obvious by following Gram in determining the par- ameters. The suggestion of these forms in probabihty theory is closely associated with the development of the hypothesis of elementary errors (deviations) as given by Charlier.2i A very readable discussion of the manner in which the Type A series arises in the probability theory of the distribution of a variate built up by the summation of a large number of independent elements is given in the recent book by Whittaker and Robinson on The Calculus of Observations y pages 168-74. In the present monograph, we shall limit our discus- sion of the probability theory underlying T)^es A a^d B to showing in Chapter VII that a certain Une of develop- ment of the binomial distribution suggests the use of the COEFFICIENTS OF TYPE A SERIES 65 Type A series as an extension of the ordinary De Moivre- Laplace approximation, and the Type B series as an ex- tension of the Poisson exponential approximation con- sidered in Chapter II. This development is postponed to the final chapter of the book because it involves more formal mathematics than some readers may find it con- venient to follow. Certain important results derived in Chapter VII are stated without proof in §§ 19-21. While a mastery of the details of Chapter VII is not essential to an understanding of the results given in §§ 19-21, the reader who can follow a formal mathematical develop- ment without special difficulty may well read Chapter VII at this point instead of reading §§ 19-21. In § 56 of Chapter VII we follow closely the recent work of Wick- selp4 in the development of the forms of the Type A and Type B series. Then in §§ 57-59 we deal with the princi- ples involved in the determination of the parameters in these type forms. 19. The coefficients of the Type A series expressed in moments of the observed distribution. If we measure x from the centroid of area as an origin and with units equal to the standard deviation, c, we may write the Type A series in the form (13) 1 F{x)^<t>{x)-\-az4>^'\x)+a,<l>^'Kx)+ .... + an<t>^^Kx) 1 " -«V2 + .. where 0(^) = (7(27r) 1/2 and <f>^"^(x) is the »th derivative of <t)(x) with respect to X. 66 FREQUENCY FUNCTIONS OF ONE VARIABLE It will be shown in § 57 that the coefficients an for (w = 3, 4, . . . .) may then be expressed in the form (14) a,=''-^-^\'°F{x)H„{x)dx, •00 where g.(,)„,..-!l(!Lll)^..-.+ >'("-')(^^7^2)(>'-3) ,^. is a so-called Hermite polynomial. To determine a„ numerically, we replace F{x) in (14) by the corresponding observed frequency function f(x), and replace x by x/cr if we measure x with ordinary units (feet, pounds, etc.) instead of using the standard devia- tion as the imit. Then we may write (15) o,= (^J^y(^)H„g)i^ . Insert the values of E nix Id) for w = 3, 4, 5 in (15), and we obtain coefficients in terms of moments as follows, using the symbol a, for the quotient /Xj/o^ n — M3 _ a« °'^~ 0^3!"" 3! 06= -^ (M5- 10M3<r2) = -- (as- lOas) COEFFICIENTS OF TYPE B SERIES 67 20. Remarks on two methods of determining the co- efficients of the TjTpe A series. It will be shown in § 57 that formula (14) for any coefficient a„ of the Type A series may be derived by making use of the fact that 0^"^(a:) and the Hermite polynomials Hf^{x) form a biorthogonal system. Then as indicated on page 168 we obtain a^ in terms of moments of the observed distribu- tion. As a second method of obtaining an in terms of the moments of the observed distribution f{x)^ it will be shown in § 58 that the values of the coefficients given in § 19 may be derived by imposing the least-squares crite- noTi^ that (16) F= r^ -^[f{x)-F{x)]'dx shall be a minimum. 21. The coefficients of the Type B series. For the Type B series (12), we shall for simplicity limit the deter- mination of coefficients to the first three terms. More- over, we shall restrict our treatment to a distribution of equally distant ordinates at non-negative integral values of X. Then the problem is to find the coefficients Co, Ci, c^ in F{x) = c4{x)^\-c,^^l^{x)+C2^^{x) , where fora;=0, 1, 2, . . . . 68 FREQUENCY FUNCTIONS OF ONE VARIABLE By expressing the coefficients in terms of moments of the observed distribution as shown in § 59, we find when X is taken equal to the arithmetic mean of the given observed values. 22. Remarks. With respect to the selection of Type A or Type B of Charlier to represent given numerical data, no criterion corresponding to the Pearson criteria has been given which enables one to distinguish between cases in which to apply one of these types in preference to the other, but T3^e B applies, in general, to certain decidedly skew distributions; and, in particular, to dis- tributions of variates having a natural lower or upper bound with the modal frequency much nearer to such natural bound than to the other end of the distribution. For example, a frequency distribution of the number dy- ing per month in a city from a minor disease would have the modal value near zero, the natural lower bound. While the systematic procedure in fitting Charlier curves to data is not so well standardized as the methods used in fitting curves of the Pearson system to data, tables of 0(/), where t is in units of standard deviation, of its integral from to /, and of its second to eighth deriva- tives are given to five decimal places for the range / = to / = 5 at intervals of .01 by James W. Glover,^ and tables of the function, its integral and first six derivatives are given by N. R. Jorgensen^® to seven decimal places for / = to/ = 4. 23. Skewness. Charlier has fittingly called the coeffi- cients ^3, a^, ^6, . . . . , along with the mean and standard SKEWNESS 69 deviation, the characteristics of the distribution. The co- efficients az and 6^4 niay be interpreted so as to give charac- teristics which appear very significant in a description of a distribution to a general reader with little or no mathe- matical training. It is the common experience of those who have dealt with actual distributions of practical sta- tistics that many of the distributions are not symmetrical. A. measure is needed to indicate the degree of asymmetry or skewness of distributions in order that we may de- scribe and compare the degrees of skewness of different distributions. A measure of skewness is given by (17) 5=-3a3 = ^ = |a8. Another measure of skewness is ,.Q. „_Mean— Mode {10) o . In this latter measure we have adopted a convention as to sign by which the skewness is positive when the mean is greater than the mode. Some authors define skewness as equal numerically but opposite in sign to the value in our definition. We may easily prove that the measures (17) and (18) are equal for a distribution given by the Pearson T3^e III curve, and approximately equal for a distribution giv- en by the first two terms of the Gram-Charlier Type A when S as defined in (17) is not very large. For the Pearson Type III (p. 54), dy^ {x-{-a)y dx Co-\-CiX 70 FREQUENCY FUNCTIONS OF ONE VARIABLE When the parameters in this equation are expressed in moments about the mean, the equation takes the form ldy_ x-{-nz/2(T ^ ydx iJL2-{-fJLsx/2(T^ ' if the origin is at the mean of the distribution. The mode is the value of x for which dy ^ M3 dx la^ That is, Mean — Mode _ m3 ~o 'la' Hence the measures (17) and (18) are equal for the Type III distribution. For a distribution given by the first two terms of Type A, we are to consider the frequency curve (19) y=0(^)+a3<^^'H^) We shall now prove that the distance from the mean (origin) to the mode is approximately —aS when 5 is fairly small. EXCESS OR MEASURE OF KURTOSIS 71 We have from (19) if we neglect terms in 5^. Then ldy_ X '5' x^5'_„ ydx a^ a cr^ at the mode. Solving the quadratic for x we obtain x= —aS a we neglect terms of the order S^. Hence, the measures (17) and (18) are approximately equal for a dis- tribution given by the first two terms of a Gram-Charlier Type A series. 24. Excess. In the general description of a given fre- quency distribution, we may add an important feature to the description by considering the relative number of variates in the immediate neighborhood of some central value such as the mean or the mode. That is, it would add to the description to give a measure of tlie degree of peakedness of a frequency curve fitted to a distribution by comparison with the corresponding normal curve fitted to the same distribution. The measure of the peak- edness to which we shall now give attention is sometimes called the excess and sometimes the measure of kurtosis. The excess or degree of kurtosis is measured by £ = 3a4 = i(^:-3)=i(a4-3). If the excess is positive, the number of variates in the neighborhood of the mean is greater than in a normal dis- 72 FREQUENCY FUNCTIONS OF ONE VARIABLE tribution. That is, the frequency curve is higher or more peaked in the neighborhood of the mean than the corre- sponding normal curve with the same standard deviation. On the other hand, if the excess is negative, the curve is more flat topped than the corresponding normal curve. To obtain a clearer insight into the relation of the measure of excess to the theoretical representation of frequency, let us consider a Gram-Charlier series of T3^e A to three terms y = <t>(x)+ az4>^'^ (x) + a,<t>^'^ (x) (7(27r)i/2[^ acr^U W (20) (7(27r) 1/2 [-g(?-3+f(^-¥«)] When we compare the ordinate of (20) at the mean x^Q with the ordinate l/(r(27r)^''^ at the mean for the normal curve, we observe that this ordinate exceeds the corresponding ordinate of the normal curve by E/(T{2iry^^. That is, the excess E is equal to the coefGcient by which to multiply the ordinate at the centroid of the normal curve to get the increment to this ordinate as calculated by retaining the terms in 0^^^(a:) and (l>^^\x) of the Type A series. 25. Remarks on the distribution of certain trans- formed variates. Underlying our discussion of frequency functions, there has perhaps been an implication that TRANSFORMATION OF VARIATES 73 the various types of distribution could be accounted for by an appropriate theory of probability. There may, how- ever, be other than chance factors that produce significant effects on the type of the distribution. Such effects may in certain cases be traced to their source by regarding the variates of a distribution as the results of transformations of the variates of some other type of distribution. Edge- worth was prominent in thus regarding certain distribu- tions. For simple examples, we may think of the diame- ters, surfaces, and volumes of spheres that represent ob- jects in nature, such as oranges on a tree or peas on a plant. Suppose the distribution of diameters is a normal distribution. It seems natural to inquire into the nature of the distribution of the corresponding surfaces and volumes. The partial answer to the inquiry is that these are distributions of positive skewness. The same kind of problem would arise if we knew that velocities, Vj of mole- cules of gas were normally distributed, and were required to investigate the distribution of energies mv^/2. To illustrate somewhat more concretely with actual data it may be observed in looking over the frequency distributions of the various subgroups on build of men, in Volume I of the Medico- Actuarial Mortality Investiga- tion, that the distributions with respect to weight are, in general, not so nearly S3rmmetrical as the distributions as to height. In fact, the distributions as to weight exhibit marked positive skewness. For example, in the age group 25 to 29 and height 5 feet 6 inches we find the following distribution : W 105 120 135 150 165 180 195 210 F 17 722 2,175 1,346 485 155 33 3, Where P^= weight in pounds, F= frequency. 74 FREQUENCY FUNCTIONS OF ONE VARIABLE A similar feature had been observed by the writer in examining many frequency distributions of ears of corn with respect to length of ears and weight of ears. The distributions as to weight showed this tendency to posi- tive skewness, whereas the distributions as to lengths of ears were much more nearly symmetrical. It seems natu- ral to assume that the weights of bodies are closely corre- lated with volumes. We may next take account of the fact that volumes of similar solids vary as the cubes of like linear dimensions. Such concrete illustrations suggest the investigation of the equation of the frequency curve of values obtained by the transformation of variates of a normal distribu- tion by replacing each variate x of the normal distribution by an assigned function of the form kx"", where ^ is a positive constant and w is a positive integer or the recipro- cal of a positive integer. A paper on this subject by the writer appeared in the Annals of Mathematics'" in June, 1922. The skewness observed in the distributions of weights is similar to the skewness which results as the effect of this transformation when w is a positive constant. From a different standpoint S. D. Wicksell^^ in the Arkiv for Matematik, Astronomi, och Fysik in 1917 has discussed, by means of a generalized hypothesis about ele- mentary errors, a connection between certain functions of a variate and a genetic theory of frequency. The hy- potheses involved in this theory are at least plausible in their relation to certain statistical phenomena. There are thus at least two points of view which indicate that the method which uses variates resulting from transformation may rise above the position of a device for fitting distribu- tions and be given a place in the theory of frequency. A EXTENSION OF SERIES REPRESENTATION 75 recent paper^^ by E. L. Dodd presents a somewhat critical study of the determination of the frequency law of a func- tion of variables with given frequency laws, and another recent paper^° by S. Bernstein deals with appropriate transformations of variates of certain skew distribu- tions. 26. Remarks on the use of various frequency func- tions as generating functions in a series representation. In the Handbook of Mathematical Statistics (1924), page 116, H. C. Carver called attention to certain generating functions designed to make frequency series more rapidly convergent than the Type A series. In a paper pubhshed in 1924 on the ''Generalization of Some Types of the Fre- quency Curves of Professor Pearson" (Biometrika, pp. 106-16), Romano vsky has used Pearson's frequency func- tions of Types I, II, and III as the generating functions of infinite series in which these types are involved in a manner analogous to the way in which the normal proba- bility function is involved in the Gram-Charlier series. When Type I, 'H'i'-t)'-^- yoUo is used as a generating function, certain functions (f>k, which are polynomials of Jacobi in slightly modified form, occur in the expansion in a way analogous to that in which the Hermite polynomials occur in the Gram-Charlier ex- pansion. Moreover, the analogy is continued because Wo0A and (i)k form a biorthogonal system, and this prop- erty facilitates the determinations of the coefficients in the series. 76 FREQUENCY FUNCTIONS OF ONE VARIABLE When the Type III function is used as a generating function, certain functions <^jfe, which are polynomials of Laguerre in generalized form, play a r61e similar to that of the polynomials of Hermite in the Gram-Charlier expansion. While it is at least of theoretical interest that various frequency functions may assume r61es in the series repre- sentation of frequency somewhat similar to the r61e of the normal frequency function in the Gram-CharUer theory, the fact should not be overlooked that the useful- ness of any series representation in applications to nu- merical data is much restricted by the requirement of such rapid convergence of the series that only a few terms need be taken to obtain a useful approximation. CHAPTER IV CORRELATION 27. The meaning of simple correlation. Suppose we have data consisting of N pairs of corresponding variates fe)3'»)> ^' = 1, 2, . . . . , iV. The given pairs of values may arise from any one of a great variety of situations. For example, we may have a group of men in which x repre- sents the height of a man and y his weight ; we may have a group of fathers and their oldest sons in which X is the stature of a father and y that of his T oldest son; we may have mifiimal daily tempera- tures in which x is the ' • minimal daily tempera- ture at New York and y the corresponding value for Chicago; we may be considering the effect of nitrogen on wheat yield where x is pounds of nitrogen applied per acre and y the wheat yield; we may be throwing two dice where x is the number thrown with the first die and y the number thrown with the two dice together. If such a set of pairs of variates is represented by dots marking the points whose rectangular co-ordinates are (x, y)y we obtain a so-called "scatter-diagram." Assume next that we are interested in a quantitative Fig. 17 77 78 CORRELATION characterization of the association of the x's and the cor- responding y^s. One of the most important questions which can be considered in such a characterization is that of the connection or correlation as it is called between the two sets of values. It is fairly obvious from the scatter- diagram that, with values of x in an assigned interval dx (dx small), the corresponding values of y may differ con- siderably and thus the y corresponding to an assigned x cannot be given by the use of a single- valued function of X. On the other hand, it may be easily shown that in certain cases, for an assigned x larger than the mean value of x^s, a corresponding y taken at random is much more likely to be above than below the mean value of 3;'s. In other w^ords, the x's and y's are not independent in the probability sense of independence. There is often in such situations a tendency for the dots of the scatter-diagram to fall into a sort of band which can be fairly 'well de- scribed. In short, there exists an important field of statis- tical dependence and connection between the regions of perfect dependence given by a single- valued mathematical function at one extreme and perfect independence in the probability sense at the other extreme. This is the field of correlated variables, and the problems in this field are so varied in their character that the theory of correlation may properly be regarded as an extensive branch of mod- ern methodology. 28. The regressive method and the correlation sur- face method of describing correlation. It may help to visualize the theory of correlation if we point out two fundamental ways of approach to the characterization of a distribution of correlated variables, although the two methods have much in common. The one may be called REGRESSION METHOD 79 the "regression method," and the other the "correlation surface method." Let us assume that the pairs of variates (x, y) are represented by dots of a scatter-diagram, and set the problem of characterizing the correlation. First, separate the dots into classes by selecting class intervals dx. When we restrict the rr's to values in such an interval dx, the set of corresponding y's is called an a:-array of 3''s or simply an array of y's. Similarly, when we restrict the assign- ment of y's to a class interval dy, the corresponding set of x's is called a y-array of x^s or simply an array of x's. The whole set of arrays of a variable, say of y, is often called a set of parallel arrays. The regression curve y =f{x) of y on :*: for a population is defined to be the locus of the expected value (§ 6) of the variable y in the array which corresponds to an assigned value of X, as dx approaches zero. In other words, the regression curve of y on a: is the locus of the means of arrays of y's of the theoretical distribution, as dx ap- proaches zero. These equivalent definitions relate to the ideal popula- tion from which a sample is to be drawn. The regression curve found from a sample is merely a numerical approxi- mation to the ideal set up in the definition. In the regression method, our first interest is in the regression curves of y on it; and of x on y. W.e are inter- ested next in the characterization of the distribution of the values of y (array of y's) whose expected or average value we have predicted. This is accomplished to some extent by means of measures of dispersion of the values of y which correspond to an assigned value of x. To illus- trate the regression method by reference to the correlation 8o CORRELATION between statures of father and son, we may say that the first concern in the use of the regression method is with predicting the mean stature of a subgroup of men whose fathers are of any assigned height, and the next concern is with predicting the dispersion of such a sub- group. The complete characterization of the theoretical distributions underlying arrays of y's may be regarded as the complete solution of the problem of the statistical dependence of y on x. In the correlation surface method for the two vari- ables, our primary interest is in the characterization of the probability <^(:x;, y)dx dy that a pair of corresponding vari- ates {Xy y) taken at random will fall into the assigned rectangular area bounded hy x to x-\-dx and y to y-\-dy. This method may be regarded as an extension to func- tions of two or more variables of the method of theoretical frequency functions of one variable. To get at the mean- ing of correlation by this method, suppose that a func- tion g{x) is such that g{x)dx gives, to within infinitesimals of higher order, the probability that a variate x taken at random lies between x and x-{-dx; and suppose that h{x,y)dy gives similarly the probability that a variate y taken at random from the array of values which cor- respond to values of a; in the interval x to x-\-dx will lie between y and y-\-dy. Then the probability that the two events will both happen is given by the product (1) 4>{p0y y)dx dy^g{x)h(x, y)dx dy . For the probability that both of two events will happen is the product of the probability that the first will happen, multiplied by the probability that the second will happen when the first is known to have happened. CORRELATION SURFACE METHOD 8i Two cases occur in considering this product. In the first case, h{x, y) is a function of y alone. When this is the case we say the x and y variates are uncorrected and <f){x, y) is simply the product of a function of x only multi- plied by a fimction of y only. In such a case the proba- bility that a variate y will be between y and y-\-dyb, the same whether the corresponding assigned x be large or small. In the second case h{x, y) is a function of both x and y. In such cases, the probability that a variate y will be between y and y+dy is not, in general, the same for corresponding assigned large and small values of x. In such cases the two systems of variates are said to be correlated. Thus, in considering for example a group of college students, the height of a student is probably uncorrelated with the grades he makes in mathematics or with the income of his father, but his height is cor- related with his weight, and with the height of his father. Both the regression method and the correlation sur- face method of dealing with correlation have been in evidence almost from the earliest contributions to the sub- ject. The early method of Francis Galton was essentially the regression method, but the mathematical solution of the special problem^^ which he proposed to J. D. Hamilton Dickson in 1886 consisted in giving the equation of the normal frequency surface to correspond to given lines of regression. The solution of this problem thus involved the correlation surface method. Furthermore, the early contributions of Karl Pearson to correlation theory, in- volving the influence of selection, stressed frequency sur- faces*^ more than regression equations. But. beginning with a paper^ by G. Udny Yule in 1897, the theory has been developed without limitation to a particular type of 82 CORRELATION frequency surface. It is a fact of some interest that Yule returned very closely to the primary ideas of Galton, by placing the emphasis on the lines of regression. Moreover, the success of the regression method of approach should give us an insight into the simplicity and fundamental character of Galton 's original ideas. 29. The correlation coefficient. The degree of correla- tion is often measured by the Pearsonian coefficient of correlation represented by the letter r. Consider N pairs of variates (x,-, 3';), i = l, 2, . . . . , iV, such as are de- scribed above, and let (*, y) represent the corresponding arithmetic means of re's and y's. Then <^z = <r« = i=A' -11/2 r 1=1 i=y 11/2 r 1/2 1/2 are the standard deviations of the two series. Assuming that at least two of the jc's are unequal so that o-,=|=0, we let any variate which is denoted by :*:, in original units (yards, miles, pounds, dollars) be denoted by Xi when measured from the mean x with the standard deviation o-, as a unit. Similarly, let the value )f yi be denoted by y< when measured from the mean y with <jy as a unit. That is, Xi = {Xi-x)/(T,: , Then in terms of x and >'J, given by the simple formula yi^{yi-y)/<^y' the correlation coefficient is (2) THE CORRELATION COEFFICIENT S^ That is, the correlation coefficient of two sets of vari- ates, expressed with their respective standard deviations as units, may be defined as the arithmetic mean of the products of deviations of corresponding values from their respective arithmetic means. We have defined the correlation coefficient r for a sample. The expected value of the right-hand member of (2) in the sampled population is then the correlation co- efficient for the population. While the formula (2) is very useful for the purpose of giving the meaning of the correlation coefficient, other formulas easily obtained from (2) are usually much better adapted to numerical computation. For example, (3) r^ ^_ _ (4) m^>-T%-z*-f\ 1/2 are ordinarily more convenient than (2) for purposes of computation. When 'N is small, say < 30, formula (4) is readily ap- plied. When N is large, appropriate forms for the calcula- tion of r are available in various books. Still other forms for expressing r are useful for certain purposes. For example, for the purpose of showing that — l^r^l, we shall now give two further formulas for r. By simple algebraic verification and remembering that 84 CORRELATION 1 = ^x;VA^ = X^',- ViV, it follows that (2) may be written in the forms^ (6) r l+^'^{xi+y'i)^ . From these two formulas, we have the important proposi- tion that (7) -l^r^l. THE REGRESSION METHOD OF DESCRIPTION 30. Linear regression. Suppose we are interested in the mean value yx of the ys in the a:-array of y's. The simplest and most important case to consider from the standpoint of the practical problems of statistics is that in which the regression of y on ic is a straight line. Assum- ing that the regression curve of y on ic in the population is a straight line, we accept as an approximation the line yx'^mx-^-h which fits "best" the means of arrays of the sample. The term "best" is here used to mean best under a least-squares criterion of approximation. In applying the criterion the square (yx — mx — by for each array is weight- ed with the number in the array. Let Nx be the number of dots in any assigned ic-array of y's. Then the equation of our line of regression would be (8) yx==mx+b , where m and b are to be determined by the condition that the sum (9) j;iNx(yx-mx-b)\ LINEAR REGRESSION 85 with observed data substituted for x, yx and Nx from all arrays, is to be a minimum. Differentiating (9) with re- spect to h and m, we have (10) -2Y,N,{%-mx-b)^(^, (11) -2Y,N,{%-mx-h)x = ^ . We may note that Nxjx is equal to the sum of all y's in an array of >''s. If we examine these equations on making substitutions for jx and x, it is easily seen that they are, except for grouping errors which vanish as dx-^O^ equiva- lent to the equations (12) -2Y,{yi-mXi-h)=0 , (13) -2Y,oCi{yi-mxi-h) = , where the summation is extended to all the given pairs. That is, we may find the regression line by obtaining the linear function y = mx-\-h, which gives the best least- square estimate of the values of y which correspond to assigned values of x. Take the origin at the mean of a:'s and the mean of ys. Then ^yi = 0, Xx, = 0. Hence, from(12), Z> = 0. From (13) and the equation of the line of regression of y on a; is 0-1 (14) y^r-^x. 86 CORRELATION Similarly, the line of regression of a: on 3; is (15) x^r^'-^y. It should be remembered that the origin is at the mean values of :j;'s and of >''s when the regression equations take the forms (14) and (15). It is obvious that these equa- tions may be written as (16) y-y = r- {x-x) and (17) x-x^r""/ {y-y) when we take any arbitrary origin. The coefficient rcTy/dx is called the regression coeffi- cient of y on X, and similarly raxi Oy is the regression co- efficient of X OYiy. If we use standard deviations as units of measurement the regression equations (14) and (15) become (18) y' = rx' , x' = ry' , and the regression coefficients are equal to each other and to the correlation coefficient. When there is no correlation between a;'s and y's, r = 0, and the regression lines of ;y on a; and of a: on y are parallel to the X- and y-axes, respectively. On the other hand, when r = 0, it is not necessarily true that there is no cor- relation. Indeed, there may be a high correlation^^ with non-linear regression when r = 0. For example, we may have r = when y is a simple periodic function of x. STANDARD DEVIATION OF ARRAYS 87 31. The standard deviation of arrays — mean square error of estimate. In passing judgment on the degree of precision to be expected in estimating the value of a vari- able, say y^ by means of the regression equation of y on x, it is important to have a measure of the dispersion in arrays of >''s. The mean square error s% involved in taking the ordi- nates of the line of regression as the estimated values of y may be very simply expressed by s\ = a% (1— r^). To prove that s% takes this value, we may write the sum of the squares of deviations in the form Heno e, we have (19) sl=al{\-r^), (20) j,=a,(l-r»)i/2 This value of 5, may be regarded as a sort of average value of the standard deviations of the arrays of >''s, and is sometimes called the root-mean-square error of estimate of jy or more briefly, the standard error of estimate of y. The factor (1 -^2)1/2 in (20) has been called the coefficient of alienation or the measure of the failure to improve the estimate of y from knowledge of the correlation. When the standard deviation of an array of >''s is re- garded as a function, say S{x), of the assigned x, the curve y = S{x)/(Ty is called the scedastic curve. It may be de- scribed as the curve whose ordinates measure the scatter 88 CORRELATION in arrays of ^''s in comparison to the scatter of all ^''s. When S(x) is a constant, the regression system of y on ic is called a homoscedastic system. When S{x) is not con- stant, the system is said to be heteroscedastic. For a homo- scedastic system with linear regression, Sy=(Ty{\—r^y^^ is the standard deviation of each array of y's. To illustrate (20) nimierically, let us suppose that r = .5 gives the correlation of statures of fathers and sons. Assuming linear regression, the root-mean-square error of estimate of the height of a son derived from the as- signed height of the father would be ^»=^v[.75]i/2=.866<ry. That is, the average dispersion in the arrays of heights of sons which correspond to assigned heights of fathers is about .87 as great as the dispersion of the heights of all the sons. It is, therefore, fairly obvious that we cannot, with any considerable degree of reliability, predict from r = .5 the height of an individual son from the height of the father. However, with a large N, we can give a very reliable prediction of the mean heights of sons that corre- spond to assigned heights of fathers. It should be remembered that we have thus far as- sumed linear regression of yonx. An analogous consider- ation of the dispersion in arrays of :i:'s gives for the mean square error of estimate when we assume linear regression of x on y. 32. Non-linear regression — the correlation ratio. In case a curve of regression, say of y on x, is not a straight THE CORRELATION RATIO 89 line, the correlation coefficient as a measure of correlation may be misleading. In introducing a correlation ratio, rjyx, oi y on X as an appropriate measure of correlation to take the place of the correlation coefficient in such a situation, we may get suggestions as to what is appropri- ate by solving for r^ in (19). This gives (21) r^^l-sl/& where we may recall that s^ is the mean square of devia- tions from the line of regression. Then f=±(i-4A^.)i/2. This formula could be used appropriately as a defini- tion of r in place of our definition in (2), and its examina- tion may throw further light on the significance of r. When ^y = 0, the formula gives r= 1 and, as we have seen earlier, all the dots of the scatter-diagram must then fall exactly on the line of regression y = rayx/dx- When Sy = ay, the formula gives r = 0, and the regression line is in this case of no aid in predicting the value of y from as- signed values of x. In the formula r^ = l — sl/(Tl it is im- portant to keep in mind that the mean square deviation 4 is from the line of regression (§31). Next, let Sy^ be the corresponding mean square of deviations from the means of arrays. Then in the population Sy^ = sl when the regression is strictly linear, but Sy^=^sl when the regression is non-linear. This fact suggests the use of a formula closely related to [1— ^5/o"y]'^^ for a measure of non-Unear regression by replacing Sy by Sy. We then write (22) vlr=l-s7/al, 90 CORRELATION where rjyx is the correlation ratio of y on x, and Sy^ is the mean square of deviations from the means of arrays whether these means are near to or far from the line of regression. For linear regression of y on x, we have Vyx = r^ in the population. In general, we may say that the correlation ratio of ;y on :v is a measure of the clustering of dots of the scatter- diagram about the means of arrays of y's. An analogous discussion for the arrays of x's obviously leads to giving rjxyy the correlation ratio of x on y. That r)lx ^ 1 and that the equality holds only when all the dots in each array are at the mean of the array follows at once from (22). That rjlx"^ r^may be shown by recalling the meanings of 4 in (21) and of Sy^ in (22). A mean square of deviations in each array is a minimum when the deviations are taken from the mean of the array. Hence, the Sy^ in (22) must be equal to or less than 4 in (21) for the same data, since the deviations in (21) are measured from the line of re- gression. Hence, we have shown that Moreover, when the regression oi y on x is linear, rjyx — ^^ found from the sample differs from zero by an amount not greater than the fluctuations due to random sampling. Indeed, the comparison of the quantity rjyx — rl with its sampling errors becomes the most useful known criterion for testing the linearity* of the regression of y on x. THE CORRELATION RATIO 91 For some purposes, it is convenient to express the correlation ratios in a form involving the standard devia- tion of the means of arrays. For this purpose, let J, be the mean of any array of ^''s, and (Sy^ the standard devia- tion of the means of arrays when the square {yx — yf of each deviation is weighted with the number iV, in the array. Then it follov/s very simply that „" _ ^'2 g\ O y Oy That is, the correlation ratio oi y on x is the ratio of the standard deviation of the means of arrays of 3''s to the standard deviation of all )/'s. The calculation of the correlation ratio with a large number N of pairs may be carried out very conveniently as a mere extension of the calculation of the correlation coefficient. For a form for such c'alculation, see Handbook of Mathematical Statistics, page 130. In order to get a fair approximation to a correlation ratio in a population from a sample, it is important that the grouping into class intervals be not so narrow as to give arrays containing very few variates. Certain valu- able formulas for the correction of errors due to grouping have been published. ^^ When the regression is non-linear, the correlation may be further characterized by the equation of a curve of regression that passes approximately through the means of arrays of a given system of variates. As early as 1905, the parameters of the special regression curves given by polynomials y =f{x) of the second and third degrees were determined in terms of power moments and product 92 CORRELATION moments. In 1921, Karl Pearson^ published a general method of determining successive terms of the regression curve of the form (24) y=fix) = ao\po+ai\pi-{- • • • • +Cn\A» , where ^o, fli, . . . . , On are constants to be determined and the functions ips form an orthogonal system of func- tions of X. That is, ^(^-^^.^ = 0, if the summation ^ be taken for all values of x correspond- ing to a system of arrays with frequency in an :r-array given by Nx. An exposition of the theory of non-linear regression curves is somewhat beyond the scope of this monograph. 33. Multiple correlation. Thus far we have considered only simple correlation, that is, correlation between two variables. But situations frequently arise which call for the investigation of correlation among three or more vari- ables. A familiar example occurs in the correlation of a character such as stature in man with statures of each of the two parents, of each of the four grandparents, and possibly with statures of others back in the ancestral line. Other examples can be readily cited. Indeed, it is very generally true that several variables enter into many prob- lems of biology, economics, psychology, and education. The solution of these problems calls for a development of correlation among three or more variables. Suppose we have given N sets of corresponding values of n variables Xi, Xij . . , . J Xn. Assume next that we separate the val- MULTIPLE CORRELATION 93 ues of Xi into classes by selecting class intervals fe, dxz, . . . . , dxn of the remaining variables. When we limit the it:2's to an assigned interval dx2, Xs^s to an assigned interval dxs, and so on, the set of corresponding iCi's is sometimes called an array of 0:1 's. The locus of the means of such arrays of Xis in the theoretical distribution, as dx2, dxz, . . . . , dXn approach zero, is called the regression surface of Xi on the remaining variables. It will be convenient to assume that any vari- able, Xj, is measured from the arithmetic mean of its N given values as an origin. Let dj be the standard devia- tion of the N values of Xj, and let rpq be the correlation coefficient of the N given pairs of values of Xp and Xq. Then we seek to determine hn, bn, . . . . , bin, c, the para- meters in the linear regression surface, (25) Xi = bi2X2 + bizX3+ • • • • +binXn + C , of Xi on the remaining variables so that Xi computed from (25) will give on the whole the "best" estimates of the values of Xi that correspond to any assigned values of X2, 0^3, .... , Xn. Adopting a least-squares criterion, we may determine the coefficients in (25) so that (26) U=J2(^^-^i2X2-bi3X3- • • • • -binXn-Cy shall be a minimum. This gives for the linear regression surface of Xi on X2, Xz, . . . . , Xn, q = 2 94 CORRELATION where Rpq is the cofactor of the ^th row and the ^th column of the determinant (28) 1 ^12 ri3 . . . . Tin ^21 1 ^23 . . . . r^n rz^ r^ 1 .... ^nl ^n2 1 For simplicity we shall limit ourselves to n = 3 in giv- ing proofs of these statements, but the method can be extended in a fairly obvious manner from three variables to any number of variables. Equating to zero the first derivatives of V in (26) with respect to c, bn, and bis, we obtain, when n = 3, the equations c = 0, V — 223 '^2(a ~ ^i2-'^2 - bizXz) = . — 2 ^:r3(xi — bi2X2 — buXz) = , The last two equations may be written in the form ^a:i.T2 - bi2^xl — biz^X2Xz = , ^XiXz — bi2Ylx2X3 - ^13X^3 = . By expressing the summations in terms of standard de- viations and correlation coefficients, we have (29) (30) Nbl2(Tl + Nbizr23cr2<Ts = Nri2Cri(T2 , Nbi2r23<^2az-i-Nbiz(Tl = Nrizaiffs , MULTIPLE CORRELATION Solving for bn and bu, we obtain 95 1^12 ''23 ''12 ''23 *"- dot ri3 1 1 r23 — ^ 0-2 ri3 1 1 ''23 ^23 1 rj3 1 h -^1 t>13 = — 0-3 1 ri2 ''23 ri3 1 ''23 ''23 1 Henc^ ^^^ Kn (Tq where i?^ is the cofactor of the pih. row and gth column of i? = If the dispersion (scatter) (ri.23 . ...» of the observed values of Xi from its corresponding computed values on the hyperplane (27) is defined as the square root of mean square of the deviations, that is, 1 rn ''13 ^21 1 ''23 TZI ''32 1 (31) <^1J23 n -T^ ^^ (observed Xi — computed XiYj then it can be proved that (32) (Ti^.. .n = (Ti{R/Rny^^ 96 CORRELATION To prove this for w = 3, we may write from (27) and (31) — "^^ (-^11 + -^12 "i'-'^lS "I" 2i?lli?12''l2 + 2RiiRizri3 + 2i?i2i?l3''23) ^11 ^2 [RniRu-\-ri2Ri2-\-rizRi3) +-^i2(-Ki2+ ''12^11 +''23^13) -{•Ri3{Ri3-^rizRn-\-r2^i2)] • Since from elementary theorems of determinants, •^11 + ''12^12 + ''13^13 = -R , i?12 + ''l2^11 + ''23^13 = , /^IsH- ''13^11 + ''23^12 = , we have (33) al2^ = a? R/Rn , (Ti^ = (TiiR/RnY^^ . As an extension of the standard error of estimate with two variables (p. 87), it is true for n variables that the standard error 0-1.23 .... n of estimating 0:1 from assigned values of X2, X3, . . . . , Xn is the standard deviation of each array of XiS, provided all regressions are linear and the standard deviation of an array of x/s is the same for all sets of assignments of X2, Xz, . . , . , Xn. Next, we shall inquire into the dispersion of the esti- mated values given by (27) . Since the mean value of these estimates is zero, when the origin is at the mean of each MULTIPLE CORRELATION COEFFICIENT 97 system of variates, we have the standard deviation aiE ot the estimates of Xi given by = _£l|,^,+,^,} = <r?(l-A) The correlation coefficient ri.23 n between the ob- served values of Xi and its corresponding estimated values calculated from the linear function (27) of X2, 0^3, .. , Xn is called the multiple correlation coefficient of order n—\ of Xi with the other n — \ variables. The multiple correla- tion coefficient ri.23 . . . . « is expressible in terms of simple correlation coefficients by the formula (34) ri^....» = [l-i?/i?iiP/2. To prove (34), limiting ourselves to w = 3, we write „ * 2'V^^i/ R\ix^ R\iXi\ iV<ri(7i£ri.23 = <Tt > -I -^ r-~ p" 7) -^^criX Kii (72 All 0-3/ -AVf -N(j\ Ru {Ri2ru-\- Ruris) iR-Rn)=N<Tlil-R/Rn) Since (ri£ = (Ti[l-/?MiJi/% we have the result sought, 98 CORRELATION The relation (34) is very significant because it enables us to express multiple correlation coefficients in terms of sim- ple correlation coefficients. From equations (32) and (34), it follows that (35) <Ti.23 .... n = <^l(l~"''l.23 . . . . n) • 34. Partial correlation. It is often important to obtain the degree of correlation between two variables Xi and x^ when the other variables Xs, Xi, . . . . , Xn have assigned values. For example, we might find the correlation of statures of fathers and sons when the stature of the moth- er is an assigned constant, say 62 inches. In general, sup- pose we have found a correlation between characters A and B, and that it is a plausible interpretation that the correlation thus found is due to the correlation of each of them with a character C. In this case we could remove the influence of C, if we had a sufficient amount of data, by restricting our data to a universe of A and B corre- sponding to an assigned C. In accord with this notion, we may define a partial correlation coefficient r'12.34 . . . .nOi Xi and X2 for assigned Xz, Xi, , , . . , Xn, SiS the correlation coefficient of Xi and X2 in the part of the population for which X3, x^y . . . . , Xn have assigned values. A change in the selection of as- signed values may lead to the same or to different values of r 12.34 ...... Suppose we are dealing with a population for which the regression curves are straight lines and the regression surfaces are planes. Thus, let us assume that the theo- retical mean or expected values of Xi and X2 for an as- signed X3, Xiy . . . . , Xn are 613^3 + ^14:^4+ • • • • -f binXn , b23X3-h b2iXi-{- ' • ■ -]-b2nXn , PARTIAL CORRELATION qq respectively. Then a partial correlation coefficient r'n.si .... n is the simple correlation coefficient of residuals and ^.34 .... n=X2 — b23X3—b2iX4— ' ' ' ' — 62n^n limited to the part of the population Nz4 . . . n of the total N for which Xz, Xi, . . . . , Xn are fixed. Suppose further that the population is such that any change in the assignment of values to Xs, Xi, . . . . , Xn does not change the standard deviation of 0:1.34 ... „. nor of Ji:2.34 ....«, nor the value of /^.u ....«• Such a population suggests that we define (36) ri2.34 ... - Nai.zi . . n C2.34 n where the summation is extended to N pairs of residuals, as Ihe partial correlation coefficient of Xi and Xy for all sets of assignments oi x^ X4, . . . . , x^. If the population is such that r'12.34 ... n is not the same for eachViiflerent set of assignments oix3,X4, ,Xn, the right-hand member of (36) may still be regarded as a sort of average value of the correlation coefficients of Xi and X2 in subdivisions of a population obtained by assigning X3, X4, . . . . , Xn, or it may be regarded as the correlation coefficient between the deviations of Xi and X2 from the corresponding predicted values given by their linear regression equations on Xz, X4, . . . . a:„. The partial correlation coefficient as given in (36) is lOO CORRELATION expressible in terms of simple correlation coefl&cients by the formula (37) ^12.34 . . . . n — [i?lli?2 11/2 > where Rpq is a cof actor defined in ^33. We may prove (37), limiting ourselves ton = 3y as follows: By definition ^12.3 = A^0"1.3 (72.3 -^O'l.S 0r2.3 ^:x:iX2 — /-IS —2:^2^3 — ^23 —^ ^1:^^3+^13^23 "^^^ [2 (*'-^'v! ^')'2 (*^-^- 7I *')! 1/2 yi2~ ^13^23 2 -i?l [(l-n3)(l-r^3)]^/2 [i2lli?22]l/2 Thus, (37) is proved for n — 3. An important relation between partial and multiple correlation coefficients may now be derived. From (37) we have t 2 1 — ^"12.34 RnR22~ -f^l2 RuRii By a well-known theorem of determinants,^ ^11 ^12 R12 R22 = RiiR22—Ri2=RRu 22 MULTIPLE AND PARTIAL CORRELATION loi Hence we have RRu22 Ru 1-^1 1 — ^12 34 ...••■p 5"" P„„ 1 2 All -tV22 ^22 1 — ri 34 i^ll22 since from (32) and (35), and similarly R . 2 •^-=1— ''1.23 ..n> •^22 4 J = 1— ^^1.34.. .« /?n 22 Thus we can express the partial correlation coefficient ;'i2.34 „ of order n-2 (the number of variables held constant) in terms of the multiple correlation coefficient ^1.23 „ of order w~l and the multiple correlation co- efficient ri.34 n of order n-2. 35. Non-linear regression in n variables— multiple correlation ratio. The theory of correlation for non-linear regression lends itself to extension to the case of more than two variables as has been demonstrated by the con- tributions of L. Isserlis^^ ^nd Karl Pearson.*^ Consider the variables Xi, Xz, . . . . y Xn, and fix atten- tion on an array of Xi's which corresponds to assigned values of x^, xz, , ^«. Next, let *i.23 . . n be the mean of the values in the array of :x:i's and let 0-1.23 . . . . n be the standard deviation of these means of arrays of a^i's, where the square of each deviation Xi.23 . . . . n from the mean of o^i's is weighted with the number in the array in I02 CORRELATION finding this standard deviation. Then the multiple cor- relation ratio 7/1.28 » of iCi on :jc2, ^3, . . . . , ^» may be defined by -2 f2H\ J - ^^•'^ ** \PQ) 1?1.23 . . . » — 2 The analogy with the case of the correlation ratio for two variables seems fairly obvious. While the method of computing the multiple correlation ratio 7/1.23 .... n is simple in principle, it is unfortunately laborious from the arithmetic standpoint. 36. Remarks on the place of probability in the regres- sion method. Thus far we have discussed simple correla- tion by the regression method without using probabilities in explicit form. To be sure, probability theory is in- volved in the background. It seems fairly obvious that it would be of fundamental interest to construct urn schemata which would give a meaning to the correlation and regression coefficients in pure chance. In a paper** published by the author in 1920, certain urn schemata were devised which give linear regression and very simple values for the correlation coefficient. Other schemata ap- parently equally simple give non-linear regression. The general plan of the schemata consists in requiring certain elements to be common in successive random drawings. It appears that the construction of such urn schemata will tend to give correlation a place in the elementary theory of probability. In a recent book*' by the Russian mathematician, A. A. Tschuprow, an important step has been taken to- ward connecting the regression method of dealing with CORRELATION AND PROBABILITY 103 correlation more closely with the theory of probability. This is accomplished by a consideration of the under- lying definitions and concepts for a priori distribu- tions. It may be noted that we have not based our develop- ment of the regression method on a precise definition of correlation. Instead we have attempted a sort of genetic development. It may at this point be helpful in forming a proper notion of the scope and limitations of the regres- sion method to give a definition of correlation from the re- gression viewpoint. It seems that a general definition will involve probabilities because we shall almost surely wish to idealize actual distributions into theoretical distribu- tions or laws of frequency for purposes of definition. In a general sense, we may say that y is correlated with x whenever the theoretical distributions in arrays of y\ are not identical for all possible assigned values of x, and we say that y is uncorrected with x whenever the theoretical distributions in arrays of y's are identical with each other for all possible values of x. By the identity of the theo- retical distributions in arrays of >''s, we mean that they have equal means, standard deviations, and other par- ameters required to characterize completely the distribu- tions. It is fairly obvious that our discussion of the re- gression method is incomplete in a sense because we have not given a complete characterization of distributions in arrays. Our characterization of the statistical depend- ence of y on ic may be regarded as complete when the arrays of y's are normal distributions, because the dis- tributions are then completely characterized by their arithmetic means and standard deviations. I04 CORRELATION THE CORRELATION SURFACE METHOD OF DESCRIPTION 37. The normal correlation surfaces. The function 2=/fe, :r2, . . . . , Xn) is called di frequency function of the n variables, Xi,a;2, . . ^x^ if zdxi dX2 .... dxn gives, to within infinitesimals of higher order, the proba- bihty that a set of values of Xi, X2, . . . . , Xn taken at random will lie in the infinitesimal region bounded by Xi and Xi-i-dxi, X2 and X2-]-dx2, . . . . y Xn and x„-{-dXn. When the variables are not independent in the proba- bility sense, the surface represented by s =f(xi, X2, . , , Xr,) is called a correlation surface. With the notation of § 29 for simple correlation, the natural extension of the theory underlying the normal frequency function of one variable to functions of two variables x and y leads to the correlation surface 2rxy 27ro-,a,(l-r2)i/2 2(1 1 /«« y« 2rxy \ Moreover, with the notation of § 2f2> on multiple corre- lation the natural extension to the case of a function of n normally correlated variables Xi^ X2, . . . . , Xn gives a frequency function of the exponential type where is a homogeneous quadratic function of the n variables which may be written in the form R\ U\ q\ (Tj 0-2 / NORMAL CORRELATION SURFACES 105 the determinant R and its cofactors Rpp and Rpq being defined in § 33. We thus have a correlation surface in space of « + 1 dimensions. For purposes of simplicity we shall limit our deriva- tions of normal frequency functions to functions of two and three variables thus restricting the geometry involved to space of three and four dimensions. The equation of the normal frequency surface may be derived from various sets of assumptions analogous to and extensions of sets of assumptions from which the normal frequency curve may be derived. Some of these deriva- tions make no explicit use of the fact that in normal correlation the regression is Knear. That is, linear re- gression is considered as a property of the frequency sur- face obtained from other assumptions. But we may con- nect the frequency-surface method closely with the re- gression method by involving linear regression of one of the variables on the others as one of the assumptions from which to derive the surface. This is the plan we shall adopt in the following derivation. Let us assume, first, that one set of variates, say the x's, are distributed normally about their mean value taken as an origin. Then in our notation (p. 47 and § 29) (39) ^-^, e-^. dx , to within infinitesimals of higher order, is the probability that an x taken at random will lie in the interval dx. Assume next that any array of y's corresponding to an assigned it: is a normal distribution with the standard deviation of an array given by <Ty(\—r^y^'^ as found earlier in this chapter (§31), and finally, assume that the re- lo6 CORRELATION gression of ^^ on a; is linear. Then in the notation of simple correlation is, to within infinitesimals of higher order, the probability that a y taken at random from an assigned array of ^''s will lie in the interval dy. By using the elementary principle that the probability that both of two events will occur is equal to the product of the probabilities that the first will occur and that the second will occur when the first is known to have occurred, we have the product z dxdy of (39) and (40) for the proba- bility, to within infinitesimals of higher order, that x will fall in dx and the corresponding y in dy, where 1 1 / a;« y2 2rxy \ is the normal correlation surface in three dimensions. Let us turn next to the derivation of the normal corre- lation surface in four dimensions. Following the notation of multiple correlation we seek a normal frequency func- tion Z=f{Xl, X2y Xi) . We shall assume first that pairs of the variates, say of X'iS and XzS, are normally distributed. Then by what has just been demonstrated about the form of the correla- tion surface in three dimensions, the expression 1 X2^ . xi *2 ^i\ (42) r. — vTTTo «"2^r:^ W'^o^^"^'" ^. W ^^ ^^8 27r(r2(73[l-r23]^/^ NORMAL CORRELATION SURFACES 107 is, to within infinitesimals of higher order, the probability that a point fe, ^3) taken at random lies within the area dxidxz. We next assume that the regression of Xi on x^ and Xi is linear, and that each array of Xi's corresponding to an assigned (x2, x^) is a normal distribution with standard deviation <ri.23 = (7iWi?ii)i/2 given by (32). Then in the notation of multiple correlation, the prob- ability that a variate taken at random in an assigned {x2, :r3)-array of Xi?> will lie in dx^ is given, to within infini- tesimals of higher order, by -/^ ^^TTT^ e 2/20-, \ * * ^11 0-2 ^u 0-3/ dxi (43) ai(2TrR) 1/2 Then the probability that a point {xi, X2, x^ taken at random will lie in the volume dxidxzdxz is given, to within infinitesimals of higher order, by the product of (42) and (43). This gives, after some simplification, for the proba- bility in question, z dxidx^dxz, where 1 (44) z ~(2»)3/2 7?i/2<r,„2<r,^ -J*, and ♦41 f Rn \ |+2i.. XiXi (71(72 (7i (73 (72 (73/ io8 CORRELATION 38. Certain properties of normally correlated dis- tributions. The equal-frequency curves obtained by mak- ing z take constant values in equation (41) are an infinite system of homothetic ellipses, any one of which has an equation of the form The area of the ellipse is (l-r2)i/2 and the semiaxes are given by a = k\ and b = y\, where k and k' are functions of o-,, Cy, and r. The probability that a point {x, y) taken at random will fall within any ellipse obtained by assigning X is given by (45) ^TTCT. cr, I " e-2(ib) ^' X JX = 1 - e-2(rbi) ZwCTa Attention has often been called to the equal frequency ellipse known as the "probable" ellipse. The probable ellipse may be defined as that ellipse of the system such that the probability is 1/2 that a point (x,y) of the scatter-diagram (see Fig. 18, p. 109) lies within it. This means by (45) that x« 2a-rr.= i ^ X2=1.3863 (l-r^) . x» From (45) it follows that [X/(l -r^)] e ^(i-^^) AX gives, to within infinitesimals of higher order, the probability NORM.\L CORRELATION 109 that a point (x, 3;) taken at random will fall in a small ring obtained by taking values of X in AX. We may determine the ellipse^^ along which, for a given small ring AX, we should expect more points {x, y) than along any other ellipse of the system. For a constant AX, the probability is a maximum when X^ = 1 — r^. Hence, what may be called the ellipse of maximum probabiUty is x" y^ 2rxy_ (Jr. CTjy (Tx (Tt 1 To illustrate the meaning of this ellipse, we may say that in Bertrand's illustration of shooting a thousand shots at a target, the probability is greater that a shot will strike along this ellipse than along any other elHpse of the system. It is an interesting fact that -the ellipse of maximum probability is identical with the orthogonal projection of parabolic points of the correlation surface on the plane of distribution. To prove this theorem, we no CORRELATION simply find the locus of parabolic points on the surface (41) by means of the well-known condition This gives dx^ dy^ \dxdy) ffx (^y ^x ^y which establishes the theorem. By comparing X2 = l-r2 with X2 = 1.3863 (l-r^), we note that the probable ellipse is larger than the ellipse of maximum probability. For the statures of 1 ,078 husbands and wives, the two ellipses just discussed are shown on the scatter-diagram in Figure 18. By actual count from the drawing (Fig. 18), it turns out that 536 of the 1,078 points are within the probable ellipse and 412 are within the ellipse of maximum probability. These numbers differ from the theoretical values by amounts well within what should be expected as chance fluctuations. Another interesting problem in connection with the correlation surface relates to the determination of the locus along which the frequency or density of points on the plane of distribution (scatter-diagram) bears a simple relation to the corresponding density under independence. Thus, we seek the curve along which dots of the scatter- diagram are k times as frequent as they would be under independence where ^ is a constant. Equating z in (41) to k times the corresponding value of z when r = in (41), we obtain after slight simplification the hyperbola (Fig. 18) NORMAL CORRELATION ill Karl Pearson dealt*^ with this curve for ^ = 1. That is, he considered the locus along which the density of points of the scatter-diagram is the same as it would have been under independence. The fact that the density of dis- tribution at the centroid in (41) is 1/(1— r2)i/2 times as much as it would be under independence naturally sug- gests the study of the locus of all points for which /fe = l/(l-r2)i/2 in (46). It turns out that in this case the hyperbola degenerates into straight lines r (Tx These lines are shown as lines AB and CD on Figure 18. They separate the plane of distribution into four com- partments such that one-fourth is the probability that a pair of values (x,y) taken at random will give a point falling into any prescribed one of these compartments. Although no further discussion of the properties of normal correlation surfaces will be attempted in this monograph, certain properties analogous to those men- tioned for the surface in three dimensions would follow rather readily in the case of the surfaces in higher dimen- sions. Thus the system of ellipsoids of equal frequencies has been studied to some extent."*^ In a paper by James McMahon,^® the connection between the geometry of the hypersphere and the theory of normal frequency func- tions of n variables is established by linearly transforming the hyperellipsoids of equal frequency into a family of hyperspherical surfaces, and by applying the formulas of hyperspherical goniometry to obtain theorems in multiple and partial correlations. 112 CORRELATION 39. Remarks on further methods of characterizing correlation. In bringing to a conclusion our discussion of correlation, it may be of interest to point out a few of the limitations and omissions in our treatment, and to give certain references that would facilitate further reading. We have not even touched on the methods of dealing with correlation of characters which do not seem to admit of exact measurement, but admit of classification; for example, eye color, hair color, and temperament may be regarded as such characters. Such characters are some- times called qualitative characters to distinguish them from quantitative characters. The correlation between two such characters has been dealt with in some cases by the method of tetrachoric^^ correlation, in other cases by the method of contingency,^^ and by the method of cor- relation in ranks^^ in cases where the items are ordered but not measured. We have not touched on the methods of dealing with correlation''^ in time series — a subject of much importance in the methodology of economic statis- tics. The methods and theories of connection and con- cordance of Gini^° for dealing with correlation have been omitted. No discussion has been given of the fundamental work of Bachelier^i on correlation theory in his treatment of continuous probabilities of two or more variables. Our discussion of frequency surfaces in §37 is limited to normal correlation surfaces. The way is, however, fairly clear for the extension^^ of the Gram-Charlier system of representa- tion to distributions of two or more variables which are not normally distributed. While great difficulties have been encountered in the past thirty years in attempts to pass naturally from the Pearson system of generalized frequency curves to analo- REMARKS 113 gous surfaces for the characterization of the distribution of two correlated variables, it is of considerable interest to remark that substantial progress has been made re- cently on the solution of this problem by Narumi,^^ Pear- son," and Camp.^ Although the many omissions make it fairly obvious that our discussion is not at all complete, it is hoped that enough has been said about the theory of correlation to indicate that this theory may be properly considered as constituting an extensive branch in the methodology of science that should be further improved and extended. CHAPTER V RANDOM SAMPLING FLUCTUATIONS 40. Introduction. In Chapter II we have dealt to some extent with the effects of random sampling fluctuations on relative frequencies. But it is fairly obvious that the interest of the statistician in the effects of sampling fluc- tuations extends far beyond the fluctuations in relative frequencies. To illustrate, suppose we calculate any sta- tistical measure such as an arithmetic mean, median, standard deviation, correlation coefficient, or parameter of a frequency function from the actual frequencies given by a sample of data. If we need then either to form a judgment as to the stability of such results from sample to sample or to use the results in drawing inferences about the sampled population, the common-sense process of induction involved is much aided by a knowledge of the general order of magnitude of the sampling discrep- ancies which may reasonably be expected because of the limited size of the sample from which we have calculated our statistical measures. We may very easily illustrate the nature of the more common problems of sampling by considering the deter- mination of certain characteristics of a race of men. For example, suppose we wish to describe any character such as height, weight, or other measurable attributes among the white males age 30 in the race. We should almost surely attempt merely to construct our science on the basis of results obtained from the sample. Then the ques- 114 INTRODUCTION 1 15 tion arises : What is an adequate sample for a particular purpose? The theory of sampling throws some light on this question. The development of the elements of a the- ory of sampling fluctuations in various averages, coeffi- cients, and parameters is thus of fundamental importance in regarding the results obtained from a sample as ap- proximate representatives of the results that would be obtained if the whole indefinitely large population were taken. One of the difficult and practical questions involved in making statistical inquiries by sample relates to the invention of satisfactory devices for obtaining a random sample at the source of material. A result obtained from a sample unless taken with great care may diverge signifi- cantly from the true value characteristic of the sampled population. For example, the writer had an experience in attempting to pick up a thousand ears of Indian com at random with respect to size of ears. It soon appeared fairly obvious that instinctively one tended to make ''runs" on ears of approximately the same size. The sam- ple would probably not be taken at random when thus drawn. Such systematic divergence from conditions nec- essary for obtaining a random sample is assumed to be eliminated before the results that follow from the theory of random sampling fluctuations are applicable. In the practical applications of sampling theory, it is thus im- portant to remember that the conditions for random sampling at the source of data are not always easily ful- filled. In fact, it seems important in certain investigations to devise special schemes for obtaining a random sample. For example, we may sometimes improve the conditions for drawing a random sample of individuals by the use ii6 RANDOM SAMPLING FLUCTUATIONS of a ball or card bearing the number of each individual of a much larger aggregate than the sample we propose to measure and by then drawing the sample by lot from such a collection of balls or cards after they have been thoroughly mixed. Even with urn schemata containing white and black balls thoroughly mixed, it must be as- sumed further that one kind of balls is not more slippery than another if slippery balls evade being drawn. The appropriate devices for obtaining a random sample de- pend almost entirely on the nature of the particular field of inquiry, and we shall in the following discussion simply assume that random samples can be drawn. In an inquiry by sample, the following fundamental question comes up very naturally about any result, say a mean value ic, to be obtained from a sample of 5 Indi- viduals: What is the probability that x will deviate not more numerically than an assigned positive number 5 from the corresponding unknown true value x that would be given by using an unlimited supply of the material from which the 5 variates are drawn? This question pre- sents difficulties. An ideal answer is not available, but valuable estimates of the probability called for in this question may be made under certain conditions by a procedure which involves finding the standard deviation of random sampling deviations. For the unknown true value f referred to above, con- tinental European writers very generally use the mathe- matical expectation or the expected value of the variable (cf. § 6). In what follows, we shall to some extent adopt this practice and shall find it convenient to assume the following propositions without taking the space to demon- strate them: . EXPECTED VALUE 117 I. The expected value E [x — E{x)\ of deviations of a variable from its expected value E{x) is zero. II. The expected value of the sum of two variables is the sum of their expected values. That is, E{x-\-y) = E{x)+E{y). III. The expected value of the product of a constant and a variable is equal to the product of the constant by the expected value of the variable. That is, E{cx) = cE{x) . IV. The expected value of the product xy of corre- sponding values of two mutually independent variables X and y is equal to the product of their expected values, where we call x and y mutually independent if the law of distribution of each of them remains the same whatever values are assigned to the oth6r. V. In particular, if x and y are corresponding devia- tions of two mutually independent variables from their expected values, the expected value of the product xy is zero. It is fairly obvious that V follows from I and IV. It is convenient in the discussion of random sampling fluctuations to deal with the problem of the distribution of results from samples of equal size. To give a simple example, let us conceive of taking a random sample con- sisting of 1 ,000 men of a well-defmed race in which some character is measured giving us 1,000 variates. Next, sup- pose we repeat the process until we have 1,000 such sam- ples of 1,000 men in each sample. Then each of the sam- ples would have its own arithmetic mean, median, mode, standard deviation, moments, and so on. Consider next the 1,000 results of a given kind, say the 1,000 arithmetic means from the samples. They would almost surely differ but slightly from one another in comparison with differ- ences between extreme individual variates. But if the Ii8 RANDOM SAMPLING FLUCTUATIONS measurements are reasonably accurate the means would differ and form a frequency distribution. This frequency distribution of means would have its own mean (mean of means) and its own standard deviation. We are especially interested in such a standard deviation, for it may be taken as an approximate measure of the variabihty or dispersion of means obtained from different samples. This standard deviation (standard error) would no doubt be a fairly satisfactory measure of sampling fluctuations for certain purposes. Although the process of finding mean values from each of a large number of equal samples with a large number of individuals in each sample gives us a useful conception of the problem of sampling errors in mean values, it would ordinarily be a laborious and usually an impractical task because of paucity of available data to carry out such a set of calculations. The statistician ordinarily obtains a result from a sample by calculation, say a mean value Xy and then investigates the standard deviation of such re- sults without taking further samples. That such a treat- ment of the problem is possible is clearly an important mathematical achievement. The space available in the present monograph will not permit the derivation of formulas for the standard devia- tion of sampling errors in many types of averages or parameters. In fact, we shall limit ourselves to presenting only sufficient derivations of such formulas to indicate the nature of the main assumptions and approximations in- volved in the rationale which supports such formulas, and certain of their interpretations. Preliminary to deriving formulas for standard deviations of sampling errors in certain averages and parameters, we need to find the STANDARD ERROR IN CLASS FREQUENCIES 119 Standard deviation and correlation of errors in class fre- quencies of any given frequency distribution. For brevity we shall use the expression "standard error" in place of "standard deviation of errors." 41. Standard error and correlation of errors in class frequencies. Suppose we obtain from a random sample of a population an observed frequency distribution /l,/2, ....,/<,.... ,/» with a number ft of individuals in a class t, and with a total of /1+/2+ .... +/n = -^ individuals observed in the sample. Suppose next that we should obtain a large number of such samples of s observations each taken under the same essential conditions. A class frequency ft will vary from sample to sample. These values// of will form a frequency distribution. We set the problem of expressing the ex- pected value of the square of the standard deviation oy^ in terms of observed values. To solve this problem, we may consider that any ob- servation to be made is a trial, and that it is a success to obtain an observation for which the individual falls in the class /. Let pt be the probability of success in one trial, and qt=l—pt be the corresponding probability of failure. In sets of 5 trials with a constant probabihty pt of obtaining an individual in the class t, we have from page 27 that the square of the standard deviation of /< in the theoretical distribution is given by (1) (rit^sptqt = spt(l-pt). 120 RANDOM SAMPLING FLUCTUATIONS In statistical applications, we do not ordinarily know the exact value of pty but accept the relative frequency ft/s as an approximation to /?« if 5 is large. If we thus accept ft/s as an approximation to pt^ and substitute p^=J^/s in (1), we obtain (2) 4=/.(i-/.A) as an approximate value of the square of ay^ conveniently expressed in terms of observed frequencies. The value (2) is regarded as an appropriate approxi- mation to the value of (1) because (1) may be obtained from (2) by replacing the quotient ft/s by its expected value pt. It is usually agreed among statisticians, how- ever, that a better approximation to (1) would be an ex- pression which as a whole has the second member of (1) as its expected value. The expected value of the product ft(l-ft/s) is not the product spt{l-pt) of the expected values of its factors, as we shall see in the next paragraph. It will be found that the second member of the equation (3) ''h=7^A'-i) has spt{^—pt) as its expected value, and (3) is therefore regarded as a better approximation than (2) for express- ing (1) in terms of observed frequencies. The reason for the advantage of formula (3) over formula (2) is the sub- ject of frequent inquiries by students of statistics, and it is hoped that the discussion here given will contribute to answering such inquiries. In accordance with the principle just stated it will be seen that the error introduced by replacing sptil — pt) STANDARD ERROR IN CLASS FREQUENCIES 121 hy ft{\— ft/ s) involves not only sampling errors, but also a certain systematic error. Thus, although the expected value of /< is spt (p. 26) and the expected value of 1 —Jt/s is !—/><, we shall see as stated above that the expected value of the product /<(! —fi/s) is not equal to the product spt{l — pt) of the expected values, but is in fact equal to {s—l)pt{l — pt)- We may prove this by first expressing (1), with the help of the definition of oy^, in the form (4) E[if,-spd'] = sp,{l-pd , and then applying the last proposition on page 21 which states that the expected value of the square of the vari- able X is equal to the square of the expected value of x increased by the expected value of the square of the de- viations of X from its expected value. Thus, for a variable X =ft with an expected value spt, we write E{ft') = s'Pt'+El{ft-spty] = sW+sp,{l-pd from (4). Further, (5) E[Ml-f,/s)]=E{ft)-\E(f,'^) = spt-spt^-ptil-pt) = {s-l)pt{l-pt) . By multiplying both members of (5) by s/{s—l), we may write sp,ii-p,)=Ey-^Mi-fjs)Y Thus, in approximating to the value spt(l — pt) in the right member of (1) by means of a function of the ob- 122 RANDOM SAMPLING FLUCTUATIONS served /<, we note that the function sft(l—ft/s)/{s — l) has the expected value spt(l—pt) which we seek, and that /«(1 —ft/s) given in the right member of (2) as an approxi- mation to spt{l ~pt) contains a systematic error. In finding standard errors in means, moments, correla- tion coefficients, and so on, it is important to know the correlation between deviations of frequencies in any two classes. Let dft be the deviation of ft from the theoretical mean or expected value of the class frequency in taking a random sample of 5 variates. Then since /i -1-/2+ • • • • +/<+ • • • • +fn = s = si constant, we have (6) 5/14-5/2+ • • • • +8ft+ • • • • +5/n = . If our sample has given 8ft more than the expected number in the class /, it may reasonably be assumed that a deficiency equal to —8ft will tend to be distributed among the other groups in proportion to their expected relative frequencies. Now suppose we had a correlation table made of pairs of values of 5/^ and 5/^, obtained from a large number of samples. Consider the array in which 8ft has a fixed value By (6), for each sample, -5/^ = 5/1+5/2+ .... +5/e_i+5/,+i+ • . . . +5/„ . Assume that the amount of frequency in the left mem- ber of this equality is distributed to terms of the right member in such proportion that, for a fixed 5/i, the mean value of 8ft' is This gives the mean of the array under consideration. CORRELATION OF ERRORS 123 It is fairly obvious that the correlation coefl&cient of A'' pairs of deviations x and y from mean values is equal to X ' where y^ is the mean of the x-array of y's, and iV, is the number in the array. Then ruj,(Ty = mean value of xy^ = ^ ^^ xyj^,. . By attaching this meaning to the correlation coefficient ^itiv ^^U and/r and using (7) f(^r the mean of the array, we have fSiSf^Si^ff = mean value of — h]t - -^'^^ 1-Pi = T-^ (i^ean value of dft^) = — ^ aft^ l — pt i — pt (8) =-sptPt'iTom(l) (9) =-M' as a first approximation. A systematic error is involved in replacing sptpt^ by ftft'/$ on account of the correlation between // and ff. 124 RANDOM SAMPLING FLUCTUATIONS To deal with the effect of this correlation, we may first write (3), page 83, in the form N r = l If we are dealing with a population or theoretical dis- tribution rather than with a sample, this formula gives us the proposition that the expected value of the product, Xiyi, of pairs of variables is equal to the product, xy, of their expected values increased by the product, ra^cTy, of the correlation coefficient and the two standard devia- tions. To apply this proposition when X/=// and 3^,=/^, we note from (8) that, for the population, r<Jx(Ty= —sptpf, and recall that E(ft)=spt and E{ft')=spt>. Then the proposition stated above gives us E{ft]t') = s''pipi'-sptpt', (10) E{JJc/s) = \E{fJ,) = {s- \)p,pt' . To obtain the right member of (8) as accurately as possible in terms of the observed // and // , we multiply both members of (10) by s/{s — l) and then note that fift'/{s — l) has the expected value sptpf In the right member of (9), the value ftfc/s used as an approxima- tion to sptpt' thus contains a certain systematic error. To eliminate the systematic error from (9), we write (11) ry,r,af,a,,= -{^^ in place of (9) as a second approximation to (8). REMARKS 125 42. Remarks on the assumptions involved in the deri- vation of standard errors. The three outstanding assump- tions that should probably be emphasized in considering the validity and the limitations of the results (2) and (9) are (a) that the probability that a variate taken at random will fall into any assigned class remains constant, (b) that the number 5 is so large that we obtain certain valuable approximations by using the relative frequency ft/s in place of the probability pt that a variate taken at random will fall into the class /, and (c) that any sampling deviation dft from the expected value of a class frequency is accompanied by an apportionment of —dft to other class frequencies in amounts proportional to the expected values of such other class frequencies. Our use of assump- tion (b) involves more than is apparent on the surface, because in its use we not only replace a single isolated probability pthy a, corresponding relative frequency /</5, but we further assume the liberty of using certain func- tions of the relative frequencies in place of these functions of the corresponding probabiHties or expected values. This procedure may lead to certain systematic errors in addition to the sampling errors. For example, we have, in obtaining (2), used the function //(I— //A) oi ft/s in place of the same function spt(l—pt) of the expected value pt, and have by this procedure tended to under- estimate the expected value when 5 is finite. That is, sft{l—fi/s)/{s—l) and not ft{^—ft/s) is our best esti- mate of the expected value. However, when s becomes large, ft{\ —ft/s) is a valuable first approximation to the expected value. The rule that the expected value of a function may be taken as approximately equal to the function of the ex- 126 RANDOM SAMPLING FLUCTUATIONS pected value has been much used by statisticians in a rather loose and uncritical manner. A critical study of the application and limitations of this rule was published by Bohlmann^^ in 1913. While it is beyond the scope of this monograph to enter upon a general discussion of Bohl- mann's conclusions, it is of special interest for our purpose that the application of the rule leads at least to first ap- proximations when the functions in question are algebraic functions. Although it may seem that we have in the derivation of (2) and (9) taken the liberty to substitute relative frequencies rather freely in place of the proba- bilities required in an exact theory, this procedure may be extended to any algebraic functions when the number s is very large, with the expectation of obtaining useful approximations. Since certain derivations which follow make use of (2) and (9), the resulting formulas involve the weaknesses and limitations of the above assumptions. 43. Standard error in the arithmetic mean and in a qih moment coefficient about a fixed point. For the arith- metic mean of s observed values of a variable x we write t=i where /i is the class frequency of Xt. Suppose the s values constitute a random sample of observations on the variable x. Suppose further that we continue taking observations on x until we have a very large number of random samples each consisting of s ob- served values. Then assume that there exists an expected value of each ft about which the observed /t's exhibit dispersion, and that corresponding to these expected val- STANDARD ERROR IN THE MEAN 127 ues there exists a theoretical mean value x oi x about which the ^'s calculated from samples of 5 exhibit dis- persion. Using 5/ and dx to denote deviations in any sam- ple from the expected values of /and x, respectively, we write s8x = ^Xt8ft f sK8xy = T.i^Md + 2j^'{xat'8ft8ft') , where the sum J^ extends from t = ltot = n, and 2' is the sum for all values of t and /' for which t4=t'. Next, sum both members of this equality for all sam- ples and divide by the number of samples. This gives in the notation for standard deviations (p. 1 19) and for the correlation coefficient (p. 123), s^4'=^Y^{xyf^)-\-2J^\xtXt'af^af^,rf^f^,) , By using (1) and (8), we have s4 = T.ix]pt)-J^(x]p^t) - 2Yl'{xtXt'ptPt) = ni-{7:xtPty==tii-x' = a\ where a is the standard deviation of the theoretical dis- tribution. Then (12) '-.=^2 • Instead of the a of the theoretical distribution, we ordinarily use the a obtained from a sample. To introduce the expected value of a^ from the sample, we may, for a first approximation, use (2) and (9) in place of (1) and (8) above, and obtain very simply a form identical with (12). 128 RANDOM SAMPLING FLUCTUATIONS As a second approximation, we may use (3) and (11) in place of (1) and (8) above, and obtain very simply (13) 4 = -^ and 1 ^ (5-1)^/2 where a is to be obtained from the sample. The distinction between the expected value of g^ from the population and from the sample involves a rather delicate point, but one that has been long recognized in the literature of error theory. The distinction has been rather generally ignored in books on statistics. In nu- merical problems, the differences in the results of formulas (12) and (13) are negligible when s is large. The standard deviation (standard error) may well serve as a measure of sampling fluctuations. But custom has not established the direct use of the standard error to any considerable extent. The so-called probable erroi has come into much more common use than the standard error. The probable error E is sometimes defined very sim- ply as .6745 times the standard error without regard to the nature of the distribution. This definition of the probable error does not impose the condition that the distribution of results obtained on repetition shall necessaiily be a normal distribution. But with such a definition of prob- able error, the real difficulty is not overcome, but merely shifted to the point where we attempt an interpretation of the probable error in terms of the odds in favor of or against an observed result obtained from a sample falling within an assigned deviation of the true value. Thus, in the derivation of (12) we have obtained, sub- ject to certain important limitations, the standard devia- PROBABLE ERRORS 129 tion of means x obtained from samples of 5 about a theo- retical mean value x which may ordinarily be regarded as a sort of a true value of the mean. If the distribution of x's obtained from samples about such a true value is assumed to be a normal distribution, we may by the use of the table of the probability integral state at once that the odds are even that an x obtained from a sample will differ numerically from the true value by not more than £=.6745 (standard error) . It is the assumption of a normal distribution of the means from samples combined with the specification of an even wager that brings the multiplier .6745 into the problem. We may further expedite the treatment of sampling errors by finding the odds in favor of or against an ob- served deviation from the true value not exceeding nu- merically a certain multiple of E, say tE. As t increases to 5, 6, or more, the odds in favor of obtaining a deviation smaller than tE are so large as to make it practically certain that we will obtain such a smaller deviation. We have discussed briefly the meaning and limitations of probable errors. The most outstanding limitation on the interpretation of probable errors is the requirement of a normal distribution of the statistical constant under consideration. We have to a considerable extent used the arithmetic mean as an illustration, but the same general requirements about the normality of the distribution would clearly apply, whatever the statistical constant. We shall consider next the standard error in a qth. moment coefficient Hq about a fixed point. By definition, I30 RANDOM SAMPLING FLUCTUATIONS For the relation between deviations from theoretical val- ues we have Then Sum both members of this equality for a large number of samples N and divide by N. This gives in the notation for standard deviations (p. 119) and for the correlation coefficient (p. 123) Using (1) and (8), we have S(Tk = 12ixVpd - Tix-^m - 2Z'WxVptPt) Then 1/2 (14) ---p-^^T . where the moments in the right-hand member relate to the theoretical distribution. By methods analogous to those used in the case of the arithmetic mean (pp. 127- 28), we may pass to moments which relate to the sample. The probable error of ju^ is then £ = .6745<7^^, and the usual interpretation of such a probable error by means STANDARD ERRORS IN MOMENTS 131 of odds in favor of or against deviations less than a multi- ple of E is again dependent on the assumption that the ^th moments /x? fpund from repeated trials form a normal distribution. 44. Standard error of the ^th moment ixq about a mean. In considering the problem of the standard error of a moment about a mean, it is important to recognize the difference between the mean of the population and a mean obtained from a sample. For simplicity, we shall consider the problem of the standard error in a gth moment about the mean of the population when we take samples of ^ variates as in § 43. The mean of the population is a fixed point about which we take the qih. moment of each sample of s variates. Then if we follow the usual plan of dropping the primes from the ju's to denote moments about a mean, we write from (14) for the square of the standard error of /x, in terms of moments of the theoretical distribution. In particular, we have for the standard error of the second moment o-J2=(m4— M2^)A . When the distribution is normal, M4 = 3^2^ , and a\^ — 2fjL2^/s • Since o- = (jU2)^S we have «<. = : ^"^ nearly. 132 RANDOM SAMPLING FLUCTUATIONS Square each member, sum for all samples, and divide by the number of samples. This gives 4(7^ 4sa^ 2s 7^ = S = ^ or .. = ./(2.)V2 Hence, the probable error in approximating to the standard deviation a of the population by the standard deviation from a sample of s variates is given approxi- mately by .6745(7. = .6745(7/(25)1/2. To avoid misunderstanding, it should perhaps be em- phasized that we have throughout this section restricted our discussion to the gth moment about the mean of the population. The problem of dealing with the standard error in the qih moment about the mean of a sample offers additional difficulties because such a mean varies from sample to sample. A problem arises from the cor- relation of errors in the means and in the corresponding moments. Further problems arise in considering the close- ness of certain approximations, especially when the mo- ments are of fairly high order, that is, when q is large. We shall simply state without demonstration that the square of the standard error in the ^th moment about the mean of a sample is given by s as a first approximation. For ^=2, this expression be- comes (iJL4 — fx->^)/s. For ^ = 4, it becomes (/xg — Ai4^)A '^^ the case of a normal distribution. These expressions for the special cases q = 2 and g = 4 are the same as for the mo- ments about a fixed point. REMARKS ON STANDARD ERRORS 133 45. Remarks on the standard errors of various statis- tical constants. We have shown a method of derivation of the standard errors in certain statistical constants (the mean, the ^th moment about a fixed point), and in partic- ular the derivation of probable error of the mean. Our main purpose has been to indicate briefly the nature of the assumptions involved in the derivation of the most common probable-error formulas. The next step would very naturally consist in finding the correlations of errors in two moments. Following this, we could deal with the general problem of standard errors in parameters of fre- quency functions of one variable on the assumption that the parameters may be expressed in terms of moment coefiicients. Thus, let y=Ax, ci, C2, ) be any frequency curve, where any parameter Ci = <i>{x, i[i2, M8, .... ,M«, • ... ) is a function of the mean and of moments about the mean. Suppose that this relation is such that we may express bCi in terms of bx, 5/x2, 5/>i3, . . . . , at least approximately by differentiation of the function </>. If we then square 6q, sum, and divide by the number of samples, we obtain an approximation to the square of the standard error in q. While, in a general way, this method may be described as a straightforward procedure, the derivation of useful formulas is likely to involve rather laborious algebraic de- tails. Moreover, considerable difficulty may arise in esti- mating the errors involved in the approximate results. The difficulties of estimating the magnitude of the 134 RANDOM SAMPLING FLUCTUATIONS errors involved are likely to be much increased when the statistical constant, for example, a correlation coefficient, is a function not merely of the moments of the separate variables, but also of the product moments of two vari- ables. In concluding these remarks on standard errors of sta- tistical parameters obtained from moments of observa- tions, it may be of interest to' point out that the character- ization of the sampling fluctuations in such parameters may be extended and refined by the use of higher-order moments of the errors in the parameters. B. H. Camp has shown that the use of moments of order higher than two may very naturally be accompanied by the use of a cer- tain number of terms of Gram-Charlier series as a dis- tribution function.^® 46. Standard error of the median. Thus far in our discussion of standard errors and probable errors, we have assumed that the statistical constants or characteristics of the frequency function are given as functions of the moments. There are, however, useful characteristics such as a median, a quartile, a decile, and a percentile of a distribution which were not ordinarily given as fimctions of moments. Such a characteristic number used in the description of a distribution is ordinarily calculated from its definition, which specifies that its value is such that a certain fractional part of the total frequency is on either side of the value in question. For example, a median m of a given distribution is ordinarily calculated from the definition that variates above and below m are to be equally frequent. Similarly, a fourth decile D4 is calcu- lated from the definition that four-tenths of the frequency is to be below Da. We are thus concerned with the sam- STANDARD ERROR OF MEDIAN 135 pling fluctuations of the bounds of the interval which in- cludes an assigned proportion of the frequency. To illustrate further, let us consider the standard error in the median m of samples of iV of a variable x distributed in accord with a continuous law of frequency given by y—f{x). We assume that there exists a certain ideal medi- an value M of the population of which we have a sample of N and that by definition of the median 1/2 is then the probability that a variate taken at random falls above (or below) M. We may then write that in any sample of N variates taken at random from the indefinitely large set, the number above M is N/2-\-d. That is, the median m of the sample is at a distance 8x = dm from M. When y has a value corresponding to a value of ai; in the inter- val 8m, we may write ydm = d to within infinitesimals of higher order. Such an equation connects the change 8m in the me- dian of the sample from the theoretical M with the sam- pling deviation d of the frequency above M. Then 6m == - and <^m = -oOd • But, from (1), page 119, ffd = Npq = — . Hence <t« = -^ — . If we have a normal distribution 136 RANDOM SAMPLING FLUCTUATIONS the value of y at the median is given by -TT^ =. 39894- and the standard error in the median found from ranks is 1.2533(r (15) <r« = « ^t/2 Although the theoretical values of the median and of the arithmetic mean are equal in a normal distribution, the median found from a sample by ranking has a sam- pling error 1.2533 times as large as the arithmetic mean obtained as a first moment from the same sample. 47. Standard deviation of the sum of independent variables. In sampling problems, it is often found useful to know the expected value of the square of the standard deviation of the sum F = Xi-[-X2+ • • • • -\-Xi of s mu- tually independent variables when we have given the standard deviations <ri, 0-2, . . . . , o", of each variable in the population to which it belongs. Assuming that the given deviations are measured from the theoretical or expected values for the populations, we consider deviations x, = X/ — £(X,), and write the devia- tion of the sum y = a:i-h^2+ • • • • -\-Xs . Square both sides, sum for the number of samples iV^, and divide by N . Then we have STANDARD ERROR OF SUM 137 If we pass to expected values, and let o-J, (ri, . . . . , (Ti denote the squares of standard deviations of the several variables and <rj that of their sum in the populations, we have (16) al^ai+ol+ ""+oi, the product terms vanishing by V, page 117. It is a matter of some interest to note how the ex- pected value just found differs from the expected value of the sum of squares of the s deviations of Xi, X2, . . , , x, from their mean i=l obtained from a sample. If we let 1 * (17) x'i=Xi--Y,Xi, we are to find E{xi^-{-X2^-\- • • • • +x'/), in terms of E(x]) = <r^i (f = 1,2, .... , s). From (17) we may write ,_5-l Xi X, X\ = X\ .... s s s ' f 5—1 Xi X, s s s Then for i^j we have 42^ .... +0:? = ^ {x,'+ . . • . +x,')-^Y.^iXi. 5 5 138 RANDOM SAMPLING FLUCTUATIONS Hence, passing to expected values, using V, page 117, (18) EW+ .... +^i^)=^ (^!+ • • • • +oX) . 48. Remarks on recent progress with sampling errors of certain averages obtained from small samples. In the development of the theory of sampling, the assumption has usually been made that the sample contains a large number of individuals, thus leading to the expectation that the replacement of probabilities by corresponding relative frequencies will give a valuable approximation. But the lower bound of large numbers has remained poor- ly defined in this connection. For example, certain prob- able-error formulas have been applied to as few as ten observations. Beginning with a paper by Student^' in 1908 there have been important experimental and theoretical results obtained on the distribution of arithmetic means, stand- ard deviations, and correlation coefiBicients obtained from small samples. In 1915, Karl Pearson^^ took an important step in ad- vance by obtaining the curve (19) y^y^x'^^e"^' for the distribution of the standard deviations of samples of n variates from an infinite population distributed in accord with the normal curve. By finding the moments /X2, M3, and jLt4 of this theoreti- cal distribution, and then tabulating the corresponding SAMPLING ERRORS OF SMALL SAMPLES 139 and the skewness of the curve (19) for integral values of n from 4 to 100, and making use of the fact that i3i = 0, ft = 3, and sk (skewness) =0 are necessary conditions for a normal distribution, Pearson shows experimentally that the distribution of standard deviations given by (19) ap- proaches practically a normal distribution as n increases. In this experiment, the necessary conditions A = 0, ft = 3, and sk — are assumed to be sufficient for practical ap- proach to a normal distribution. From this table of values, Pearson concludes that for samples of 50 the usual theory of probable error of the standard deviation holds satisfactorily, and that to apply it to samples of 25 would not lead to any error of impor- tance in the majority of statistical problems. On the other hand, if a small sample, n<20 say, of a population be taken, the value of the standard deviation found from the sample tends to be less than the standard deviation of the population. In a paper published in 1915, R. A. Fisher^^ dealt with the frequency distribution of the correlation coefficient r derived from samples of n pairs each taken at random from an infinite population distributed in accord with the normal correlation surface (p. 104), where p is the cor- relation coefficient. The frequency function %=/„(r) given by Fisher for the distribution of r was such that the investigation of its approach to a normal curve as n increases seemed to require special methods for comput- ing the ordinates and moments. Such special methods were given in a joint memoir^° by H. E. Soper, A. W. Young, B. M. Cave, A. Lee, and Karl Pearson. The val- ues of jSi and ft were computed for these distributions to study the approach to the normal curve. 140 RANDOM SAMPLING FLUCTUATIONS With respect to the approach of these distributions to the normal form with increasing values of w, it is found that the necessary conditions /3i = 0, p2 = S for a normal distribution are not well fulfilled for samples of 25 or even 50, whatever the value of p. For samples of 100, the approach to the conditions iSi = 0, ft = 3 is fair for low values of p, but for large values of p, say p>.5, there is considerable deviation of /3i from 0, and of ft from 3. For samples of 400, on the whole, the approach to the neces- sary conditions /3i = 0, ft = 3 is close, but there is quite a sensible deviation from normality when p^ .8. These re- sults give us a striking warning of the dangers in inter- preting the ordinary formula for the probable error of r when we have small samples. As to the limitations on the generality of these results, it should be remembered that the assumption is made, in this theory of the distribution of r from small samples, that we have drawn samples fi-om an infinite population well described by a normal correlation surface, so that the conclusions are not in the strictest sense applicable to distributions not normally distributed. While the results just now described have thrown much light on the dis- tributions of statistical constants calculated from small samples, it is fairly obvious that much remains to be done on this important problem. 49. The recent generalizations of the Bienayme- Tchebycheff criterion. Although the use of probable errors for judging of the general order of magnitude of the nu- merical values of sampling deviations is a great aid to com- mon-sense judgment; it must surely be granted that we are much hampered in drawing certain inferences depend- ing on probable errors because of the hmitation that the interpretation of the probable error of a statistical con- TCHEBYCHEFF INEQUALITY 141 stant is to some extent dependent in any particular case on the normality of the distribution of such constants obtained from samples, and because of the lack of knowl- edge as to the nature of the distribution. Any theory that would deal effectively with the prob- lem of finding a criterion for judging of the magnitude of sampling errors with little or no limitation on the nature of the distribution would be a most welcome contribution, especially if the theory could be made of value in dealing with actual statistical data. The Biena3ane-Tchebycheff criterion (p. 29) may be regarded as an important step in the direction of developing such a theory. We have in the Tchebycheff inequality a theorem specifying an upper bound 1/X^ for the probability that a datum taken at random will be equal to or greater than X times the stand- ard deviation without limitation on the nature of the distribution. That is, if P(X(7) is the probability that a datum drawn at random from the entire distribution will differ in absolute value from the mean of all values as much as X<7, then (20) PM^^,. To establish a first generalization of this inequality (cf. p. 29), let us consider a variable x which takes mutual- ly exclusive values Xi^ x^, . . . . , Xn with corresponding probabiUties, pi, />2, . . , pny where pi-\-p2+ . . +/>»=!. Let a be any number from which we wish to measure deviations. For the expected values of the moment of order 2s about a, we may write M2. = />l^^ + M^+ +pn(Pn' , where di = Xi — a. 142 RANDOM SAMPLING FLUCl'UATIONS "Letd^d^'j . . . . , be those deviations it:, — fl which are numerically as large as an assigned multiple \a (X>1) of the root-mean-square deviation, and let p\ p'\ . . . . , be the corresponding probabilities. Then we have ti2s^p'd'^'-^p"d"^'+ . Since d\ d" , . . . . , are each numerically as large as Xo", we have mJ-^X'V^(/''+P"+ • • • • ) . If we let P(\(t) be the probability that a value of x taken at random will differ from a numerically by as much as X(7, then'P(Xcr) =/?'+/?''+ . . . . , and Then PM^.2s.2s> ll2s and the probability of obtaining a deviation numerically less than Xcr is greater than J M2S X2V2* • This generalization of the Tchebycheff inequality is due to Karl Pearson^* except that he assumed a distribu tion given by a continuous function with a as the mean x-coordinate of the centroid of frequency area. For this case, we should merely drop the prime from /x^, and write (21) P(X(t)^ ^^ X^V^^ GENERALIZED TCHEBYCHEFF INEQUALITY 143 With 5= 1, we obviously have the Tchebycheff inequality as a special case. It is Pearson's view that, although his inequality is in most cases a closer inequality than that of Tchebycheff, it is usually not close enough to an equality to be of practical assistance in drawing conclusions from statisti- cal data. On the whole, Pearson expresses not only dis- appointment at the results of the Tchebycheff inequality, but holds that his own generalization still lacks, in gen- eral, the degree of approximation which would make the result of real value in important statistical applications. Hence, it is an important problem to obtain closer in- equalities. The problem of closer inequalities has been dealt with in recent papers by several mathematicians.®^ Camp, Guldberg, Meidel, and Narumi have succeeded particularly well by placing certain mild restrictions on the nature of the distribution function F{x). The restric- tions are of such a nature as to leave the distribution func- tion sufficiently general to be useful in the actual prob- lems of statistics. The main restriction placed on F{x) by Camp is that it is to be a monotonic decreasing func- tion of I re I when |:r|^C(7, c^O. The general ejffect of this restriction is to exclude distributions which are not rep- resented by decreasing functions of | :r | at points more than a certain assigned distance from the origin. We shall now present the main results of Camp without proof. With the origin so chosen that zero is at the mean, he reaches the generalized inequality 2^ 0* m fMS^'^TTJ^+TT-/" 144 RANDOM SAMPLING FLUCTUATIONS where /32«-2=-^ and (c_ 2s Y (2.+ l)g-l) When c = 0, the formula (22) is Pearson^s formula (21) divided by (1 + 1/25)2*. The general effect of the work of Camp and Meidel has been to decrease the larger number of the Pearson inequality (21) by roughly 50 per cent. These generaliza- tions seem to have both theoretical and practical value when we have regard for the fact that the results apply to almost any type of distribution that occurs in practical applications. Indeed, it is so satisfying to have only very mild restrictions on the nature of the distribution in judg- ing sampling errors that further progress in extending the cautious limits of sampling fluctuations given by the generalizations of the Tchebycheff inequality would be of fundamental value. 50. Remarks on the sampling fluctuations of an ob- served frequency distribution from the underljdng theo- retical distribution. If we have fitted a theoretical fre- quency curve to an observed distribution, or if we know the theoretical frequencies from a priori considerations, the question often arises as to the closeness of fit of theory and observation. In considering this question, a criterion is needed to assist common-sense judgment in testing whether the theoretical curve or distribution fits the ob- served distribution well or not. It is beyond the scope of the present monograph to deal with the theory underly- ing such a criterion, but it seems desirable to remark that the fundamental paper on this important problem of REMARKS 145 random sampling was contributed by Karl Pearson under the title, "On the criterion that a given system of devia- tions from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling," Philosophical Magazine, Volume 50, Series 5 (1900), pages 157-75. Closely related to the problem of the closeness of fit of theory and observation is the fundamental problem of establishing a criterion for measuring the probability that two independent distributions of frequency are really random samples of the same population. Pearson pub- lished one solution of this problem in Biometrika, Volume 8 (1911-12), pages 250-54. The resulting crite- rion represents an important achievement of mathemati- cal statistics as an aid to common-sense judgment in con- sidering the circumstances surrounding the origin of a random sample of data. CHAPTER VI THE LEXIS THEORY 51. Introduction. We have throughout Chapter II as- sumed a constant probability underlying the frequency ratios obtained from observation. It is fairly obvious that frequency ratios are often found from material in which the underlying probability is not constant. Then the sta- tistician should make use of all available knowledge of the material for appropriate classification into subsets for analysis and comparison. It thus becomes important to consider a set of observations which may be broken into subsets for examination and comparison as to whether the underlying probability seems to be constant from sub- set to subset. In the separation of a large number of rela- tive frequencies into n subsets according to some appro- priate principle of classification, it is useful to make the classification so that the theory of Lexis may be applied. In the theory of Lexis we consider three types of series or distributions characterized by the following properties: 1. The underlying probability p may remain a con- stant throughout the whole field of observation. Such a series is called a Bernoulli series, and has been considered in Chapter 11. 2. Suppose next that the probability of an event varies from trial to trial within a set of s trials, but that the several probabilities for one set of s trials are identical to those of every other of n sets of s trials. Then the series is called a Poisson series. 3. When the probability of an event is constant from 146 INTRODUCTION 147 trial to trial within a set but varies from set to set, the series is called a Lexis series. The theory of Lexis^^ uses these three types as norms for comparison of the dispersions of series which arise in practical problems of statistics. An estimate of the im- portance of this theory may probably be formed from the facts that Charlier^ states in his Vorlesungen iiber mathe- matischen Statistik (1920) that it is the first essential step forward in mathematical statistics since the days of La- place, and that J. M. Keynes®^ expressed a somewhat similar opinion in his Treatise on Probability (1920). These may be somewhat extreme views when we recall the contributions of Poisson, Gauss, Bravais, and Tcheby- cheff but they at least throw light on the outstanding character of the contribution of Lexis to the theory of dispersion. The characteristic feature of the method of Lexis is that it encourages the analysis of the material by breaking up the whole series into a set of sub-series for examination of the fluctuation of the frequency among the various sub-series. Such a plan of analysis surely has the sanction of common-sense judgment. In drawing s balls one at a time with replacements from an urn of such constitution that p is the constant probability that a ball to be drawn will be white, we have already established the following results for Bernoulli series: L The mathematical expectation of the number of white balls is sp (p. 26). 2. The standard deviation of the theoretical distribu- tion of frequencies is {spqY'^ (p. 27). 3. The standard deviation of the corresponding dis- tribution of relative frequencies is (pq/sY^^ (p. 27). 148 THE LEXIS THEORY 52. Poisson series. To develop the theory of the Pois son series let 5 urns, Uu C/2, U., contain white and black balls in such relative numbers that Ph /'2, . . . . , />• are the probabilities corresponding to the respective urns that a ball to be drawn will be white. Let (1) ^^h+P.+ ■■■■+P. From (1) it follows that the mathematical expectation sp of white balls in a set of s obtained one from each urn is exactly equal to the mathematical expectation of white balls in drawing s balls with a constant probability p of success. The standard deviation dp of the theoretical dis- tribution of the number of white balls per set of 5 is re- lated to the standard deviation (TB = {spqy^^ of a hypo- thetical Bernoulli distribution with a constant probability p of success, by the equation X=S J=5 (2) <T'p = spq-^{p,-py = cr%-'y^(p.-p)\ where p is equal to the mean value of pi, pzy . . . . , ps. To prove this we start with (1) and recall that sp is the arithmetic mean of the number of white balls in any set of s under the theoretical distribution. POISSON SERIES 149 Let us consider next the standard deviation <t of white balls in the theoretical series of s balls. The square of the standard deviation of the frequency of white balls in drawing a single ball with the chance pt that it will be white is given by (T\ = ptqti that is, by making ^=1 in sptqt. When the probabilities pi, p2, . . . . y ps are inde- pendent of one another, it follows from (16), page 137, that where <ti, (T2, . . . . , (t, are the standard deviations of white balls in drawing one ball from each urn correspond- ing to probabilities pi, p2, . . . . , ps, respectively, and a is the standard deviation of white balls among the s balls together drawn one from each urn. Hence, we have t=s (3) a^ = piqi+p2q2+ ' ' ' +M«= ^/'t?* • t=i But Hence and Pt = p+(pt-p) , qt = q-(pt-p) Ptqt = pq- {pt -P){p-q)-ipt- pY (4) ^Piqt = spq-^{Pt-py y since ^{pt-p)='0 . t=\ i=i Hence, we have established (2), from which it follows at once that the standard deviation of a Poisson series is less I50 THE LEXIS THEORY than that of the corresponding Bernoulli series with con- stant, probability of success equal to the arithmetic mean of the variable probabilities of success. To give an illustration of a Poisson series, conceive of n populated districts. Each district is to consist of s sub- divisions for which the probability of death at a given age varies from one subdivision to another, but in which the series of s probabilities are identical from district to dis- trict. To illustrate further this type of distribution, con- struct an urn schema consisting of 10 urns each of which contains 15 balls, and in which the number of white balls in the respective urns is 3, 4, 5, 6, 7, 8, 9, 10, 11, 12. The arithmetic mean of the probabilities of drawing a white ball is 1/2. A set of 10 is obtained by drawing one ball from each urn. Then each ball is returned to the urn from which it was drawn, and a second set of 10 is drawn. This process is continued until we have 1,000 sets of 10. The resulting frequency distribution of the number of white balls is a Poison distribution. 53. Lexis series. To give a statistical illustration of a Lexis series, conceive of n populated districts in each of which the probability of death is constant for men of given age, but is variable from district to district. To develop the theory of the Lexis distribution we draw s balls one at a time from an urn U\ with a constant probability pi of getting a white ball, from U2 with a con- stant probability /?2, . . . . , from £/„ with a constant probability />„. The mathematical expectation of white balls in thus drawing ns balls is spi-\-sp2-\- • • • • +spn = nsp, where p^(\/n)(pi+p2-{- • • • • +pn) is the arithmetic mean of the probabilities pi, p2y . . . . , />„. LEXIS SERIES 151 Since nsp is the mathematical expectation of white balls in samples of ns balls, the mathematical expectation in samples of 5 balls one at a time from a random urn is sp. This value sp is identical to the mathematical ex- pectation of white balls in samples of 5 balls of a Bernoulli series with a constant probability p. Since p t is the probability that a ball to be drawn from urn Ut will be white, the expected value of the square of the standard deviation of the number of white balls in samples of s drawn from Ut is sptQi- In other words, sptqt is the mean square of the deviations of white balls from spt in samples of 5 drawn from Ut. If the deviations were measured from sp instead of spty it follows from the theorem (p. 21) for changing the origin or axis of second moments that the mean square of the deviations would be (5) sptqt+{spt-spy . Suppose this mean value of the squares of deviations were obtained from N samples of 5 each. Then (6) Nsptqt^Ns\Pt-py would be the expected value of the sum of squares of the deviations from sp in the A^ samples of s drawn from Ut. By adding together the expression (6) for / = !, 2, . . . . , w, we have (7) f^s^p,q,^Ns'^{Pt-py for the expected value of the sum of squares of the devia- tions from sp for the n urns. In obtaining (7), we have t=l (=1 t=n 152 THE LEXIS THEORY drawn in all Nn sets of s balls of which N sets are from each um. The mean-square deviation from sp of the number of white balls in samples of s thus taken from the n urns Uiy U2, . . . . y Unis then obtained by dividing (7) by the number of sets Nn, This gives From (4) above t= ^Ptqt = npq-^{pt-py , t=i t=i and hence It should be observed from (8) that the standard devia- tion of a Lexis distribution is greater than that of a Bernoulli distribution based on a constant probability p which is equal to the mean value of the given probabili- ties pi, />2,. . • • J Pn- 54. The Lexis ratio. Let c' be the standard deviation of a series of relative frequencies obtained by experiment from statistical data. On the hypothesis of a Bernoulli distribution the theoretical value of the standard devia- tion is (TB = (pq/sy^^ where p is the probability of success in any single trial. The ratio Mj — / — LEXIS RATIO 153 is called the Lexis ratio, where a = S(t' and o-^ = s<tb. When L — \,^ the series of relative frequencies is said to have normal dispersion. When L< 1, the series is said to have subnormal dispersion. When L>\, the series is said to have supernormal dispersion. Illustrative applications of the Lexis ratio to statistical data are readily available.®^ From the nature of the Lexis theory it is fairly obvi- ous, as implied in the introduction to this chapter, that TABLE I State Births^ Deaths per 1,000 California .... 65,457 33,471 66,544 40,477 62,941 57,185 61,348 48,535 61,352 66 Connecticut 72 Indiana Kansas Kentucky 70 61 58 Minnesota . ... 58 North Carolina 66 Virginia 68 Wisconsin 72 Arithmetic mean 55,257 65.7 the application of the theory to particular statistical data involves breaking up the aggregate into a number of sub- sets according to some appropriate scheme of classifica- tion which would ordinarily depend on much knowledge of the material which is the subject of the investigation. Then we are concerned not only with a frequency ratio for the entire aggregate, but also with the stability of frequency ratios among the subsets. The dispersion of frequency ratios is calculated and compared with the ex- pected value in the case of a Bernoulli distribution. As an example, let us consider the dispersion of death- rates of white infants under one year of age in registration 154 THE LEXIS THEORY states^ of the United States in which the number of births per year of white children is between 33,000 and 67,000 (see Table I). This restriction is placed on the selection of states so that the number of instances per set has only a moderate amount of variabihty. In most of the practical problems of statistics the exact values of the underlying probabilities are unknown and the best substitutes available are the approximate values of the probabilities given by available relative fre- quencies. Substituting these frequency ratios as approxi- mations for p and g, we find the Bernoulli standard deviation from the formula cFB^ipq/sY^'^. We then com- pare (t'b with the standard deviation obtained directly from the data. The simple arithmetic mean of the death- rates is 65.7 per 1,000, and their standard deviation (with- out weighting) is 5.21 per 1,000. If these infantile death- rates constituted a Bernoulli distribution with a number of instances equal to the average number of births, 55,257 in each case, w^e should have , _ (pq\ 1/2 _ [- (.0657)(.9343) 1 1/2 ^^"Vt; ~l 55:257 J = .00105 per person, = 1.05 per 1,000 . Hence, the Lexis ratio is ^-ei-" Hence the dispersion is supernormal, and we have trong support for the inference that there is a significant variation in infant mortality from one of these states to APPLICATIONS 15s another. The full interpretation of this fact would re- quire much knowledge of the sources of the material. A reasonable plan for the determination of the maxi- mum district over which the infantile death-rates are essentially constant seems to involve breaking the aggre- gate of instances into subsets in a variety of ways and then testing results as above. Some measure of doubt will remain, but this procedure encourages the kind of analysis that gives strong support to induction. CHAPTER VII A DEVELOPMENT OF THE GRAM- CHARLIER SERIES 55. Introduction. In § 56 we shall attempt to show (cf. p. 65) that a certain line of development^^ of the binomial distribution suggests the use of the Gram-Charlier Type A series as a natural extension of the De Moivre-Laplace approximation and the Type B series as a natural exten- sion of the Poisson exponential approximation considered in Chapter II* Then in §§ 57-58 we shall develop meth- ods for the determination of the parameters in terms of moments of the observed frequency distribution, thus deriving certain results stated without proof in § 19 and §21. 56. On a development of Type A and Type B from the law of repeated trials. As in the De Moivre-Laplace the- ory, we consider the probability that in a sample of 5 indi- viduals, taken at random from an unlimited supply, r individuals will have a certain attribute. That is, the probability we wish to represent is given by and we shall use a function of the form (1) Bo(x) = ~^re{w)e''-^dw, 156 LAW OF REPEATED TRIALS 157 for interpolation between the values 5(r), where i^=—l and r=s (2) e(w) = ipe-^+qy = ^B{r)^ . In the terminology of Laplace, d(w) is the generating func- tion of the sequence B{r). We shall first show that Bo(x)=B{m) when ic is a positive integer m. To prove this, substitute d(w) from (2) in (i) and integrate. This gives sin (r—x)ir f=0 ^ ^ {s — x)v When x==w is a positive integer, each term but one of the right member vanishes and this one has the value B {m) . Accordingly, 5o(w) = B {m) . Thus formula (1) gives exactly the terms of the expan- sion of {p-\-q)* for positive integral values x=m. It may be considered an interpolation formula for values of x between the integral values. We shall be interested in two developments of this interpolation formula. The first is based on the develop- ment of log 0{w) in powers of w, and the second on the development in powers of p. The resulting types of de- velopment are known as the Type A and Type B series, respectively. 158 DEVELOPMENT OF GRAM-CHARLIER SERIES DEVELOPMENT OF TYPE A From the form of d(w) in (2), we have Develop the right-hand member of (3) in powers of w and we obtain Thus we have by integration, remembering that ^(0) = 1 , log ^(tt;)=5 pwi-k-j^ pqiwiy-j^ pqip-qK^ifiy-i , or writing ft. 6, (4) e{w)^e '• ^' we have (5) bi = sp , b2=spq , ^>3= -spqip-q), We now write (6) 0(w) == e*.««--*.«'./2 [1 _ AiiwiY+ A^iwiY Since it follows from (2) that 9{w) is an entire func- tion of w, the series in brackets in the right member of (6) converges since it is the quotient of an entire function 6(w) by an exponential factor with no singularities in the finite part of the plane. TYPE A SERIES 159 From (4), (5), and (6) we have (7) Ai = ^spq{p-q) , A4 = ^spq{l-6pq) , . Inserting 6{w) from (6) in (1), we have (8) ^oW=^ I ^we-t*-^)«*-^'^/2[l-^3M^ +A,{wiy ] . If we write (9) ^{x)=^ j (/z£;e-(*-*x)«---^«'/2 , we have from (8), (10) Boix)=m+A,^^+Aj^+..-.. If, however, 62 is not small, we may use in place of Q (jt:)the function (})(x) defined as *(^)=C 00 00 by changing the limits of integration from ± t to ± 00 Moreover, we shall prove that To prove this, we write <t>{x) = 2^ e-*««''/2 cos [w(x-bi)]dw 2V-oo i6o DEVELOPMENT OF GRAM-CHARLIER SERIES The second term vanishes because the sine is an odd function. Since the cosine is an even function, we may write (12) 0W = - I ** e-*»«"^^ cos lw{x-bi)]dw . Differentiation with regard to x gives ^^= -- Pe-^'^V2 ^ sin [w{x-bi)]dw . ax TT^o Integrate the right-hand member by parts and we have ax 02V Jq __ (x-bi) , . = ^ — 4>{,x) . Then by integration, (13) <f>{x)^Ae■'<"^^>*^^t. To find A, let a; = ^i in (12) and (13). This gives the well known definite integral A=- I '°c-*a«^/2j^^. 1 ""Jo ^ ' •*"'"(27rW^^^ Hence, we have (2irb,y ^W=7^^Z^^-''-*^^'/^- as given in (11). senes TYPE A SERIES l6i Therefore we may write in place of (10) the Type A es (14) S.(.) = ^(.)+^,^+^,^+..... where if (72 = 62. To study the degree of approximation secured in changing the limits of integration from ± tt to ±00 in passing from Q(x) to <f>(x)y we observe that <t>ix)-n(x)=^ C"^ dwe-'^^y^ cos [{x-h)] and hence [<t>{x)-n{x)]<- \dwe-'*^/^ = ~ r'°g-^V2 ^ if X2 = (r2w;8. Hence, the difference approaches zero very rapidly with increasing values of <r as may be seen by using the values of the last integral written corresponding to values of X=l, 2, 3, 4, . . . . , in a table of this probability integral. A similar examination for the derivatives of Q and will show that their differences similarly approach zero. DE\'ELOPMENT OP TYPE B To develop (2) in powers of />, we first write, since ' d]oge(w) ^ ispe^ (15) -j dw i-^(i-e«.) i62 DEVELOPMENT OF GRAM-CHARLIER SERIES a convergent series since \p{l-e-^)\<l. Since ^(0) = 1, we obtain by integration (16) \ogeiw) = Hence, writing (17) e(w)^e-'f''-'''Hl-\-B2(l-e'^y+Bsil-e'^y+ J , we have Now, from (1) and (17), (18) Bo{x)=:^ \ dwe-'^-'^^'-''^^[i+B2{l-e'^y +B^{l-e^y+-"]. Let ^(3c) = ^ I dwe-'^-'P^'-''^^ = j Q(2e;, x)dw . Then let ^^(:r)=:^(:r)-^(x~l) = ^ C' dwe-^-'^^'"'"^^ — L j '^^g-(*-i)t«-*p(i-««^) = j {l-e'^)Q{Wy x)dw . TYPE B SERIES 163 Then ^^{x) = ) ' (1 - e'^YQiw, x)dw , AV'W= j {\-e'^yQ{w,x)dw , Hence, we have (19) B,{x) =^W+52AVW+^3A'^ W-f ..... To give other forms to iHx)=~- I e-«^-*Mi->) dw , we may write ^^^^ ^^) e-'«-+^^«""' dw = ^ j '^ e-^l 1+5/) e^+^-^e'^+ ]dw (20) =gIl^r!HL^4-,^^^"(^-^)^4-.... gr L ic ^ x—1 (spy sin (x-r)T "^ r! x-r "^ ' * c-*^ . [1 sp , s^i^ = sm irx\ ^+ ^. / /^x — • • • • i64 DEVELOPMENT OF GRAM-CHARLIER SERIES sin TTxl 1 X X^ x~x^l'^(x-2)2r (21) =e- (-^y\r ]■ ' (x-r)r\ if sp is replaced by X. The foregoing analytical processes can be easily justi- fied by the use of the properties of uniformly convergent series. When x approaches an integer r, it is easily seen from (20) that each term approaches zero except the term e~'^ {spy sin {x — r)ir V r\ x — r * and this term has as its limit the Poisson exponential r! f! ' The formula (21) may therefore be regarded as defining the Poisson exponential e~'^\'/xl for non-integral values of x. The development in series (19) is useful only when p is so small that sp is not large, say sp ^ 10, s being a large number. In this case, sp is likely to be too small to allow an expansion in a Type A series. Otherwise, the develop- ment in Type A is better suited to represent the terms of the binomial series. While the above demonstration is limited to the rep- resentation of the law of probability given by terms of a binomial, Wicksell has gone much further in the paper COEFFICIENTS OF TYPE A SERIES 165 cited above in showing a line of development which sug gests tne use of the Gram-Charher series for the represen- tation of the law of probability given by terms of the hypergeometric series, thus representing the law of prob- abihty which gives the basis of the Pearson system of generalized frequency curves. Unfortunately, the demon- stration of this extension would require somewhat more space devoted to formal analysis than seems desirable in the present monograph. Hence we merely state the above fact without a demonstration. 57. The values of the coeflacients of the Type A series obtained from the biorthogonal property. If in (14) we measure x from the centroid as an origin and in units equal to the standard deviation, c, we may write in place of (14) (22) F(:*:)=0(:c)-ffl30(3)(^)_^^^^(4)(^)_^ .... where and <^^''\x) is the wth derivative of <^(x) with respect to x. The coefficients an(w = 0, 3, 4, . . . .) in the Type A series may be easily expressed in terms of moments of area under the given frequency curve about the centroidal ordinate because the functions </)^"^(a:) and the Hermite polynomials EJ^x) defined by the equation i66 DEVELOPMENT OF GRAM-CHARLIER SERIES form a biorthogonal system. Thus, (23) f^ <f>^^^x)nm(x)dx = (m4=»), X (24) and I <t>^lHm{x)dx = ^^-^^ (w=«) , and this biorthogonal property affords a simple method of determining the coefficients in the Type A series. To prove (23) and (24) we may write (25) f^ <t>^«\x)Hm{x)dx=^{-l)- f^ <l>{x)Hn{x)HUx)dx =:(_l)m+n r^^i^)(x)n^{x)dx . Integration by parts gives n </><«) (X) Hmix)dx = [0(»-l) ix) Hm{x)^ - C"<i>^--'\x)HUx)dx=- C^ <i>^--'Kx)E'm{x)dx . J-oo J-oo Continuing until we have performed w4- 1 successive inte- grations by parts, we obtain, assuming n>m, C^4>^»^{x)H„Xx)dx^ (-!)«+! r°° (i>(»-^-^\x)nJ*^-^'Hx)dx , COEFFICIENTS OF TYPE A SERIES 167 where W^'^^^x) is the (w+l)th derivative of Hmix). Since Hm{x) is a pol3Tioinial of degree miaXy its (w4- l)th derivative vanishes and we have (26) ( <l>^^\x)nM)dx = C^ 0(«)i for n>m. But from the fonn of (25), it is obvious that we could equally well prove (26) for m>n. For w = w, we proceed as above with m successive integrations. We then have, if we replace m by n, r°° 0^«J (x) Hn{x)dx = (- 1)« r "^ 0(«-«) {x) En^*^ {x)dx J—oa J-co But the nth derivative E^^Kx) of the polynomial En(x) is equal to n !. Hence, (27) j <^(«>(x)H„(ic)Jic=(-l)''w! I <i>{x)dx J-co J-co By multiplying both members of (22) by En(x) and integrating under the assumption that the series is uni- formly convergent, we have C^ F{x)En{x)dx^anC^ <i>^rO(y^)B^{x)dx^{--\Y^ 1 68 DEVELOPMENT OF GRAM-CHARLIER SERIES since by application of (26) all terms of the right-hand member vanish except the. one with the coefficient ««• Hence, a(-iy p F(x)nn{x)dx (28) .,= ^, . Moreover, to determine an numerically for an observed frequency distribution we replace F(x) in (28) by the observed frequency function /(jc). For purposes of numerical application, let us now change back from the standard deviation as a unit to measuring x in the ordinary unit of measurement (feet, pounds, etc.) involved in the problem, but still keep the origin at the centroid. This means that we replace X in (28) by x/<t. If in these units /(ic) gives the observed frequency distribution, we may write in place of (28) (29) an^<J ^^- 1 m Hn{x/a) dx/a I =^^f"fix)Hn{x/a)dx. Since En{x/a) is a polynomial of degree w in a;, the co- efficients an are thus given in terms of moments of area under the observed frequency curve. It is then fairly obvious that the determination of the moments of area under the frequency curve plays an important part in the Gram-Charlier system as well as in the Pearson system. 58. The values of the coefficients of Type A series obtained from a least-squares criterion. It may be proved by following J. P. Gram that the value of any coefficient an obtained in § 57 by the use of the biorthogonal property COEFFICIENTS OF TYPE A SERIES i6q is the same as that obtained by finding the best approxi- mation to f{x), in the sense of a certain least-squares criterion, by the first m terms of the series (m^n). To prove this statement, we may proceed as follows: Con- sider the series F(x) = ao(f>{x)+ai(j>^'^{x)+ +a^<t>^^\x) for the representation of an observed frequency function f{x). The least-squares criterion^^ that (30) V=r^^[f{x)-F{xWdx shall be a minimum leads, to values of coefficients given in § 19. To prove this, we square the binomial f(x)—F{x) and differentiate partially with regard to the parameters do, ^3, . . . . , a«. This gives dv_ d r-^ f{x)F{x)dx d r-f^, .,, 1 , =2(-l)«+' f^fix) Hn{x)dx-\-2an C^[Hr.{x)f <i>{x)dx J -Vi J — CO smce QO = rjallHo(x)Y+a\[H,ix)Y-{- .... + aUnm{xm<t>{x)dx , the product terms vanishing because of (26). ryo DEVELOPMENT OF GRAM-CHARLIER SERIES Making dV/dan = 0, we have (31) 2(-l)»+^ r j{x) Hnix)dx+2an f" lBnix)Y 4>{x)dx = J — oo J — ca But (32) r°° [Hr.{x)f4>{x)dx={- \Y (^ 4>^»^(x)Hn{x)dx^^ * From (31) and (32), we have - [^ fix) Hn{x)dX , J-co which is identical with the value obtained by the use of the biorthogonal property. 59. The coefficients of a Type B series. In consider- ing the determination of the coefficients Cq, Ci, C2, . . . . , of the Type B series, we shall restrict our treatment to a distribution of equally distant ordinates at non-negative integral values of x^ and shall for simplicity consider the representation by the first three terms of the series. That is, we write Fix) = Coiix)+CiArHx) +C2AV W , where iix)=--r for ar = 0, 1, 2, . . . . Let fix) give the ordinates of the observed distribution of relative frequencies, so that E/(*) = i. Equating sums of ordinates and first and second moments /x (and ^2 of ordinates of the theoretical and COEFFICIENTS OF TYPE B SERIES 171 observed distributions, we may now determine the co- efficients approximately from the equations : ' E lcoHx)+c,^^|^ix)+c,^'^p{x)] = Zf{x)=-l (33) I E^ [coHx)-hciAHx)+C2A'Hx)]==T.xfix)=n\ . T.AcoHx)+c,AHx)+C2A'rP{x)] = Z^y(:^) = mJ Before solving these equations for Co, Ci, and C2, we may simplify by the substitution of certain values which are close approximations when we are dealing with large numbers. Thus, we recall that we have derived in § 14, Chapter II, the following approximations: M2=E^vw=x+v. We may next easily obtain the following further ap- proximate values: ^xAxKx)^i:x[^f^ix)-4^(x-l)] = X-X-1--1 . Similarly, it is easily shown that Y.xA'iix) = 0, E^A^(;c) = -2X-l , 172 DEVELOPMENT OF GRAM-CHARLIER SERIES and Substituting these values in equations (33) ^ we obtain Co = 1 , Xco — Ci = /ii , (X+X2)co-(2X+l)ci+2c2=M2. If we take X = /xi, we have the coefficient Ci=0. Then expressing the second moment ni in terms of the second moment fi2 about the mean by the relation we have X+X2+2C2 = M2+X2 , C2 = §(ai2-X) . Hence, we write F(a;)=^W+i(M2-X)AVW, when X is taken equal to the first moment mi, which is the arithmetic mean of the values of the given variates. It is fairly obvious that this application of moments to finding values of the coefficients can be extended to more terms if they were needed in dealing with actual data. NOTES 1. Page 1. fimile Borel, Elements de la theorie des probabilitis, p. 167. Le hasard, p. 154. 2. Page 23. Julian L. Coolidge, An introduction to mathematical probability (1925), pp. 13-32. 3. Page 30. Tchebycheff, Des valeurs moyennes. Journal de Math6- matique (2), Vol. 12 (1867), pp. 177-84. 4. Page 30. M. Bienaym6, Considerations a I'appui de la decouverie de Laplace sur la loi de prubabilite dans la methode des moindres carris Comptes Rendus, Vol. 37 (1853), pp. 309-24. 5. Page 31. E. L. Dodd, The greatest and the least variate under general laws of error, Transactions of the American Mathematical Society, Vol. 25 (1923), pp. 525-39. 6. Pages 31 and 47. Some writers call this theorem the "Bernoulli theorem" and others call it the "Laplace theorem." It has been shown recently by Karl Pearson that most of the credit for the theorem should go to De Moivre rather than to BemouUi. For this reason we call the theorem the "De Moivre-Laplace theorem" rather than the "Bernoulli- Laplace theorem." See Historical note on the origin of the normal curve of errors^ Biometrika, Vol. 16 (1924), p. 402; also James Bernoulli's theorem, Biometrika, Vol. 17 (1925), p. 201. 7. Page 32. For a proof see Coolidge, An introduction to mathematical probability, pp. 38-42. 8. Pages 37 and 68. James W. Glover, Tables of applied mathematics (1923), pp. 392-411. 9. Page 37. Karl Pearson, Tables for statisticians and biometricians (1914), pp. 2-9. 10. Page 39. Poisson, Recherches sur la probability des jugemenis, Paris, 1837, pp. 205 ff. 11. Page 39. Bortkiewicz, Das Gesetz der kleinen Zahlen, Leipzig, 1898. 12. Page 48. For various proofs of the normal law, see David Brunt The combination of observations (1917), pp. 11-24; also Czuber Beobach- tungsfehler (1891), pp. 48-110. 13. Page 50. Karl Pearson, Mathematical contributions lo the theory of evolution, Philosophical Transactions, A, Vol. 186 (1895). pp. 343-414. 173 174 NOTES 14. Page 50. Karl Pearson, Supplement to a memoir on skew variation Philosophical Transactions, A, Vol. 197 (1901), pp. 443-56. 15. Page 50. Karl Pearson, Second supplement to a memoir on skew variation, Philosophical Transactions, A, Vol. 216 (1916), pp. 429-57. 16. Page 60. J. P. Gram, Om Raekkeudviklinger (1879) (Doctor's dissertation), Copenhagen, 1879; also Uber die Entwickelung reeller Func- tionen in Reihen mittelst die MHhode der kleinsten Quadrate, Journal fur Mathematik, Vol. 94 (1883), pp. 41-73. 17. Page 60. T.N.Tlnele, Almindelig I agttagelseslaere, CopenhAgen, 1889; cf. Thiele, Theory of observations, 1903. 18. Page 60. F. Y. Edge worth, The asymetrical probability-curve, Philosophical Magazine, Vol. 41 (1896), pp. 90-99; also The law of error, Cambridge Philosophical Transactions, Vol. 20 (1904), pp. 36-65, 113-41. 19. Page 60. G. T. Fechner, Kollektivmasslehre (ed., G. R. Lipps), 1897. 20. Page 60. H. Bruns, Uber die Darstellung von Fehlergesetzen, Astronomische Nachrichten, Vol. 143, No. 3429 (1897); also Wahrschein- lichkeitsrechnung und Kollektivmasslehre, 1906. 21. Page 60. C. V. L. Charlier, Uber das Fehlergesetz, Arkiv for Matematik, Astronomi och Fysik, Vol. 2, No. 8 (1905), pp. 1-9; also Uber die Darstellung "wUlkurlicher Funktionen, Arkiv for Matematik Astronomi och Fysik, Vol. 2, No. 20 (1905), pp. 1-35. 22. Page 60. V. Romano vsky. Generalization of some types of fre- quency curves of Professor Pearson, Biometrika, Vol. 16 (1924), pp. 106-17. 23. Page 63. Wera Myller-Lebedeff, Die Theorie der Intcgralglcich- ungen in Anwendung auf einige Reihenentwicklungen, Mathematische Annalen, Vol. 64 (1907), pp. 388-416. 24. Pages 65 and 156. S. D. Wicksell, Contributions to the analytical theory of sampling, Arkiv for Matematik, Astronomi och Fysik, Vol. 17, No. 19 (1923), pp. 1-46. 25. Pages 67 and 169. In the use of the least-squares criterion that V in (16), §20, and in (30), §58, shall be a minimum, a question natur- ally arises as to the propriety of weighting squares of deviations with the reciprocal l/<f>{x) = {2iry/^e^^/'^ of the normal function. Gram used this weighting without commenting on its propriety so far as the writer has been able to learn. One fairly obvious point in support of the weighting is its algebraic convenience. 26. Page 68. N. R. Jorgensen, Undersogelser over Frequensflader og Korrelation (1916), pp. 178-93. 27. Page 74. H. L, Rietz, Frequency distributions obtained by certain transformations of normally distributed variates. Annals of Mathematics, Vol. 23 (1922), pp. 292-300. NOTES 175 28. Page 74. S. D. Wicksiell, On the genetic theory of frequency, Arkiv for Matematik, Astronomi och Fysik, Vol. 12, No. 20 (1917), pp. 1-56. 29. Page 75. E. L. Dodd, The frequency law of a function of variables with given frequency laws, Annals of Mathematics, Ser. 2, Vol. 27, No. 1 (1925), pp. 12-20. 30. Page 75. S. Bernstein, Sur les courbes de distribution des proba- biliies, Mathematische Zeitschrift, Vol. 24 (1925), pp. 199-211. 31. Page 81. Francis Galton, Proceedings of the Royal Society, Vol. 40 (1886), Appendix by J. D. Hamilton Dickson, p. 63. 32. Page 81. Karl Pearson, Mathematical contributions to the theory of evolution III, Philosophical Transactions, A, Vol. 187 (1896), pp. 253- 318. 33. Page 81. G. Udny Yule, On the significance of the Bravais's formulae for regression, etc., Proceedings of the Royal Society, Vol. 60 (1897), pp. 477-89. 34. Page 84. E. V. Huntington, Mathematics and statistics, American Mathematical Monthly, Vol. 26 (1919), p. 424. 35. Page 86. H. L. Rietz, On functional relations for which the co- ejicient of correlation is zero, Quarterly Publications of the American Statistical Association, Vol. 16 (1919), pp. 472-76. 36. Page 91. Karl Pearson, On a correction to be made to the correla- tion ratio, Biometrika, Vol. 8 (1911-12), pp. 254-56; see also Student, The correction to be made to the correlation ratio for grouping, Biometrika, Vol. 9 (1913), pp. 316-20. 37. Page 92. Karl Pearson, On a general method of determining tlie successive terms in a skew regression line, Biometrika, Vol. 13 (1920-21) pp. 296-300. 38. Page 100. Maxime B6cher, Introduction to higher algebra (1912), p. 33. 39. Page 101. L. Isserlis, On the partial correlation ratio, Biometrika Vol. 10 (1914^15), pp. 391-411. 40. Page 101. Kari Pearson, 0» /Ae par/*a/ c<?rre/a/jfm ro/w, Proceed- ings of the Royal Society, A, Vol. 91 (1914-15), pp. 492-98. 41. Page 102. H. L. Rietz, Urn schemata as a basis for the developmcn of correlation theory, Annals of Mathematics, Vol. 21 (1920), pp. 306-22 42. Page 103. A. A. Tschupiovt ,Grundbegriffe und Grundprobleme der Korrelationstheorie (1925). 43. Page 109. H. L. Rietz,, On the theory of correlation with special reference to certain significant loci on the plane of distribution Annals of Mathematics (Second Series), Vol. 13 (1912), pp. 195-96. 44. Pages HI and 112. Karl Pearson, On the theory of contingency 176 NOTES and its relation to association and normal correlation^ Drapers' Company Research Memoirs (Biometric Series I), (1904), p. 10. 45. Page 111. E. Czuber, Theorie der Beobachtungsfehler (1891), pp. 355-82. 46. Page HI. James McMahon, Hyperspherical goniometry; and its application to correlation theory for N variables, Biometrika, Vol. 15 (1923), pp. 192-208; paper edited by F. W. Owens after the death of Professor McMahon. 47. Page 112. Karl Pearson, On the correlation of characters not quantitatively measurable, Philosophical Transactions, A, Vol. 195 (1900), pp. 1-47. 48. Page 112. Karl Fearson, On further methods of determining cor- relation, Drapers' Company Research Memoirs (Biometric Series IV) (1907), pp. 10-18. 49. Page 112. Warren M. Persons, Correlation of time series (Hand- book of Mathematical Statistics [1924], pp. 150-65). 50. Page 112. C. Gini, Nuovi contributi alia teoria delle relazioni statistiche, Atti del R. Istituto Veneto di S.L.A., Tome 74, P. II (1914^15). 51. Page 112. Louis Bachelier, Calcul des probabilites (1912), chaps. 17 and 18. 52. Page 113. Seimatsu Narumi, On the general forms of bivariate frequency distributions which are mathematically possible when regression and variation are subjected to limiting conditions, Biometrika, Vol. 15 (1923), pp. 77-88, 209-21. 53. Page 113. Karl Pearson, Notes on skew frequency surfaces, Bio- metrika, Vol. 15 (1923), pp. 222-44. 54. Page 113. Burton H. Camp, Mutually consistent multiple re- gression surfaces, Biometrika, Vol. 17 (1925), pp. 443-58. 55. Page 126. 0. Bohlmann, Formulierung und begriindung zweier Hilfssatze der mathematischen Statistik, Mathcmatische Annalen, Vol. 74 (1913), pp. 341-409. 56. Page 134. Burton H. Camp, Problems in sampling. Journal of the American Statistical Association, Vol. 18 (1923), pp. 964r-77. 57. Page 138. Student, The probable error of a mean, Biometrika, Vol. 6 (1908-9), pp. 1-25. 58 Page 138. Karl Pearson, On the distribution of the standard de- viations of small samples: Appendix I. To papers by "Student" and R. A. Fisher, Biometrika, Vol. 10 (1914-15), pp. 522-29. 59. Page 139. R. A. Fisher, Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population, Biometrika, Vol. 10 (1914-15), pp. 50.7-21. NOTES 177 60. Page 139. H. E. Soper and Others, On the distribution of the correlation coefficient in small samples. Appendix II to the papers of "Stu- denr and R.A. Fisher, Biometrika, Vol. 11 (1915-17), pp. 328-413. 61. Page 142. Karl Pearson, On generalised Tchebychejff theorems in the mathematical theory of statistics, Biometrika, Vol. 12 (1918-19), pp 284-96. 62. Page 143. M. Alf. Guldberg, Sur le theoreme de M. Tchehychef, Comptes Rendus, Vol. 175 (1922), p. 418; also Sur quelques inegalites dans le calcul des probabilites, Vol. 175 (1922), p. 1382. M. Birger Meidell, Sur un probleme du calcul des probabilites et les statistiques mathemaliques, Comptes Rendus, Vol. 175 (1922) , p. 806; also Sur la probabilite des erreurs, Comptes Rendus, Vol. 176 (1923), p. 280. B. H, Camp, A new generaliza- tion of Tchebycheff's statistical inequality, Bulletin of the American Mathe- matical Society, Vol. 28 (1922), pp. 427-32. Seiniatsu Narumi, On further inequalities with possible applications to problems in the theory of probability, Biometrika, Vol. 15 (1923), p. 245. 63. Page 147. W. Lexis, Vber die Thcorie der Sfubilitdt slatisticher Reihen, Jahrbuch fiir National Ok. u. Statistik, Vol. 32 (1879), pp. 60-98. Abhandlungen zur theorie der bevolkerungs und moral- statistik, Kap. V-IX (1903). 64. Page 147. C. V. L. Chariier, Vorlesungen iiber die Grundziige der matltematischen Statistik (1920), p. 5. 65. Page 147. J. M. Keynes, A treatise on probability (1921), p. 393. 66. Page 153. In this connection, the expression "L^ I" meanc "Ls 1 apart from chance fluctuations." 67. Page 153. Handbook of Mathematical Statistics (1924), pp. 88-91 C. V. L. Chariier, Vorlesungen iiber die Grundziige der mathematischen Statistik (1920), pp. 38-42. 68. Page 154. Birth statistics for the registration area of the United Stales (1921), p. 37. INDEX (Numbers refer to pages) Arithmetic mean and mathemati- cal expectation, 14-16 Bachelier, 112 Bernoulli, 2; distribution, 23; theorem of, 27-31; series, 146 Bernstein, 75 Bertrand, 109 Bielfeld, J. F. von, 2 Bienayme-Tchebycheflf criterion, 28-29; generalization of, 140-44 Binomial distributions, 22-27, 51 Bdcher, 175 Bohlmann, 126 Borel, 1 Bortkiev/icz, 39 Bravais, 3 Bruns, 49, 60 Brunt, 173 Camp, 113, 134, 143, 144 Carver, 75 Cattell and Brimhall, 38 Charlier, 2, 49, 60-67, 156-77 Coefficient of alienation, 87 Coolidge, 23, 173 Correlation, 77-113; meaning of, 77-78; regression method, 78- 103; correlation surface method, 79, 104-11, correlation coeffi- cient, 82; linear regression, 84; non-linear regression, 88; corre- lation ratio, 88-91; multiple, 92- 102; partial, 98-101; standard deviation of arrays — standard error of estimate, 87-90, 95; multiple correlation coefficient, 97; partial correlation coeffi- cient, 98-99; multiple correla- tion ratio, 101; normal correla- tion surfaces, 104-11; of errors, 122-24 Czuber, 173 De Moivre, 2, 3 De Moivre-Laplace theory, 31-38, 43-45, 156 Deviation: quartile, 38; standard, 27,47 Dickson, J. Hamilton, 81 Discrepancy, 26; relative, 27 Dispersion: normal, 3, 153; sub- normal, 3, 153; supernormal, 4, 153; measures of, 14, 27, 153 Dodd, 31, 75 Edgeworth, 2, 49, 60, 73 Elderton, 53, 60 Ellipse of maximum probability, 109 Error; see Probable error and Standard error Euler, 3 Excess, 71-72 Fechner, 49, 60 Fisher, R. A., 139 Frequency, relative, 6 Frequency curves: defined, 13; normal, 34, 47; generalized, 48- 76 Frequency distribution, observed and theoretical, 12-14 Frequency functions: defined, 13; of one variable, 46-76; normal, 34, 47; generalized, 4S-76 179 i8o INDEX Galton, 81-82 Gauss, 3, 47 Generating function, 60, 75, 76, 157 Gini, 112 Glover, tables of applied mathe- matics, 37, 68 Gram, 2, 49, 60 Gram-Charlier series, 60, 61, 65, 72, 75-76; development of, 156- 77; coefficients of, 65-68, 165-70 Guldberg, 143 Hermite poljTiomials, 66, 75-76, 165 Heteroscedastic system, 88 Homoscedastic system, 88 Huntington, 175 Hypergeometric series, 52, 165 Isserlis, 101 Jacobi polynomials, 75 Jorgensen, 68 Laguerre polynomials, 76 Laplace, 2, 3; see De Moivre-La- place theory Lexis, 2, 3; theory, 146-55; series, 146, 150; ratio, 152 Maclaurin, 3 McMahon, 111 Mathematical expectation, 14-16, 116-17; of the power of a varia- ble, 18-21; of successes, 26 Median, 134; standard error in, 135 Meidel, 143-44 Mode and most probable value 17-18, 25 Moments: defined, 18; about an arbitrary origin, 18; about the arithmetic mean, 19; applied to Pearson's system of frequency curves, 58; coefficients of Gram- Charlier series in terms of, 66, 168, 170 Most probable number of suc- cesses, 25 Multiply correlation coefficient, 97 Myller-Lebedeff, 174 Narumi, 113, 143 Normal correlation surfaces, 104- 11 Normal frequency curve, 34, 47, 50; generalized, 60, 61, 65-69, 156-61 Partial correlation coefficient, 98- 99 Pearson, 2, 47, 49; generalized fre- quency curves, 50-60, 75, 81, 92, 101, 111-13, 138, 142, 144^5, 175 Pearson's system of frequency curves, 50-60, 75 Persons, 176 Poisson, 3; exponential function, 39-45, 61, 164; series, 148 Population, 4 Probability: meaning of, 6-11; a priori, 10; a posteriori, 10; statis- tical, 10 Probable eUipse, 108 Probable error, 39, 127-30, 132 Quartile deviation, 39 Quetelet, 3 Random sampling fluctuations, 114-45 Regression: curve defined, 79; lin- ear, 84, 93; non-linear, 88-91, 101; surface defined, 93; method of, 78-103 Relative frequency and probabil- ity, 6-11 Rietz, 174, 175 Romanovsky, 60, 75 INDEX i8i Scedastic curve, 87 Sheppard, 37 Simple sampling, 22-44, 114^36 Skewness, 68-71 Small samples, 138-40 Soper, Young, Cave, Lee, and Pearson, 139 Standard deviation, 27, 82; of random sampling fluctuations, 116; of sum of independent vari- ables, 136; see Standard error Standard error: defined, 119; in class frequencies, 119-21; in arithmetic mean, 126-27; in ^th moment, 130-32; in median, 134; in averages from small samples, 138-40 Stirling, 3, 32 Student, 138 Tchebycheff, 2, 28-30, 140-44 Thiele, 49, 60 Tschuprow, 102 Whittaker, Lucy, 43 Whittaker and Robinson, 64 Wicksell, 65, 74, 164 Yule, 81 iH US A J BUi§^ifiy^^oN 1 i i GAYLORD PH(NTEOINU.S.A. QA 273.R5 3 9358 00236269 4 QA 273 R5 236269 Rietz- Henrv Lewis (d •H^ JC (0 O •H +» • (d I QQ 6 LC 0) ^ 0) (d CO (d 0) ^ CO-H <d Cv • Q? <di5 a e :3 0)01 -£ -H (d • •H F-i « 1-* COO (< • (d o c 0) » H a ^1^ (d u CO (d ^ a-H (d 000 «l •'(» ••HXI (fl :) (d o CM SO CO n <N s 0^ CQ 05 o I a m